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Preface 



The Georgetown University Round Table on Languages and Linguistics (GURT) has 
been held since 1949, placing it among the most long-standing language and linguis- 
tics conferences in the United States. GURT began as a small gathering for research- 
ers in language studies to share their current work and has gradually grown to become 
an internationally known forum for linguistic and language research, with an annual 
thematic focus. The theme of the 2010 Round Table was Arabic Language and Lin- 
guistics. At Georgetown and around the world, students are flocking to courses on 
Modern Standard Arabic and on Arabic linguistics. Arabic, one of the official lan- 
guages of the United Nations, is spoken by more than half a billion people around 
the world and is of increasing importance in political and economic spheres. In ad- 
dition the study of the Arabic language has a long and rich history: The earliest 
grammatical accounts date from the eighth century, and they include full syntactic, 
morphological, and phonological analyses of the vernaculars and of Classical Arabic. 

GURT 2010 was cohosted by the Department of Linguistics and by the Depart- 
ment of Arabic Language and Islamic Studies and was held March 12-14, with 
scholars of Arabic from around the world presenting research on various aspects of 
Arabic language study, from grammatical analysis to language pedagogy, and from 
sociolinguistic investigation to computational analysis. The invited speakers, whose 
work spanned this spectrum, were Mushira Bid, University of Utah; Ali Farghaly, 
formerly of Monterey Institute of International Studies and Cairo University, now at 
DataFlux Corporation; Catherine Miller, French Council of Research and Centre 
Jacques Berque; Karin Ryding, Georgetown University; and Yasir Suleiman, Univer- 
sity of Cambridge. GURT 2010 drew more than 200 attendees from around the world, 
including many from Europe, Asia, North Africa, and the Middle East, making it the 
largest assemblage of Arabic language scholars in North America to date. Nearly 
eighty papers were presented in the main sessions, and another two dozen were 
presented in panel sessions. This volume includes a selection of the papers presented 
that represents the range and quality of research on the Arabic language in the twenty- 
first century. 

We would like to express our gratitude to the Faculty of Languages and Linguis- 
tics and the Department of Linguistics for their financial support and to the Depart- 
ment of Arabic Language and Islamic Studies for its institutional support. We are 
grateful to our reviewers for sharing their expertise in the field and for helping us to 
select papers for both the conference and for this volume. We are also deeply in debt 
to our chief graduate organizer, Cala Zubair, whose attention to the countless details 
of organization was invaluable to the success of the conference, and to our assistant 
graduate organizer, Jaemyung Goo, who coordinated an army of graduate student 
assistants and provided essential technical assistance. We also thank the army of 
graduate students, without whom the conference would have been impossible. Our 
thanks also go to both Manela Dies and Meriem Tikue, who provided crucial admin- 
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istrative assistance before, during, and after the conference. Finally, we would like 
to thank Mark Muehlhaeusler of the Lauinger Library of Georgetown University for 
his invaluable help in revising and formatting the texts, the Arabic transliterations, 
and the references for the chapters presented here. 

As we prepared this volume to go to press, contributors and editors alike were 
distracted by the political changes sweeping across the Arab world. We hope that 
these changes bring new opportunities and inspiration for research on the Arabic 
language in the context of more open and democratic societies. 
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Table 1.1 
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The feminine ending is -a in pause and -at in construct forms. 

The adjectival ending is transcribed as -T (masc.) and -Tya (fem.), respectively. 

The vowels in Standard Arabic are represented as a, i, u, a, T, u, whereas e and 
o are used in transcribing dialects. Additional IPA symbols may be used where a 
detailed phonological transcription is necessary. 

Because /, 0, and 5 are not used in library catalogues, proper names and book 
titles are transliterated according to the Library of Congress Romanization system, 
in order to facilitate the identification of authors and their works. 
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This volume collects fifteen papers that represent the state of the art of research on 
the Arabic language in its many forms. Part I of the book, the first seven chapters, 
describes aspects of the Arabic language from a theoretical point of view, including 
computational linguistics, syntax, semantics, and historical linguistics. Part II, the 
remaining eight chapters, describes Arabic applied linguistics, sociolinguistics, and 
discourse analysis. Within each part, the chapters are ordered alphabetically by author. 

Part I starts with a discussion of syntax in chapter 1, in which Nizha Chatar- 
Moumni addresses negation in Moroccan Arabic, which is marked by the two ele- 
ments ma- and -f. Chatar-Moumni explores the historical roots of ma-l-f, drawing 
analogies with French negation and the Jesperson-cycle, and noting that the final 
element -/is often deleted. She argues that the initial negative element ma- is associ- 
ated with an indefinite quantifier, whereas the presence or absence of the final nega- 
tive marker -/'is correlated with the definiteness of the verbal arguments in its scope, 
as well as with the pragmatic force of the negation. This she takes to be syntactically 
marked with the [+undefined] feature. 

In chapter 2 Kamel A. Elsaadany and Salwa Muhammed Shams provide an 
analysis of the “floating-quantifier” construction in Arabic. They argue against both 
a transformational analysis and an adverbial analysis, providing considerable evi- 
dence that the “floated” quantifier has different scopal and interpretive properties 
than nonfloated quantifiers. Adopting a lexical-functional approach, Elsaadany and 
Shams suggest that the Arabic floated quantifier contains an anaphoric element that 
is bound by a topicalized noun phrase. The TOPIC function is identified by its ana- 
phoric properties, satisfying the extended coherence condition of lexical-functional 
grammar. 

In chapter 3 Ali Farghaly addresses the subfield of Arabic natural language 
processing. The computational analysis of the Arabic language is a domain in which 
research has blossomed in recent years. Farghaly describes particular problems that 
Arabic poses for computational treatment and provides interesting insights into the 
history of Arabic-English machine translation efforts. He charts the relationship of 
linguistic theory to computational analysis and compares traditional rule-based, sym- 
bolic approaches to natural language processing with newer statistical approaches. 
He concludes by arguing that ride-based approaches, in promoting rigorous analysis 
of the Arabic language, meet important sociocultural goals while also fulfilling the 
needs of the Arabic information processing community. 

In chapter 4 Youssef A. Haddad discusses the syntax of “raising” in Modern 
Standard Arabic, which has a class of subject-to-subject raising verbs known as verbs 
of appropinquation — or verbs of beginning Q-j/i Jl), hope and proximity 

(SjjUJl). Haddad contends that there are three types of raising structures: (1 ) forward 
raising, in which the subject is displaced from the embedded to the matrix clause; 
(2) backward raising, in which the subject is pronounced in the embedded clause but 
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bears a structural relationship to the matrix clause; and (4) nonraising, in which the 
subject appears only in the embedded clause. He presents a wide range of data argu- 
ing for this three-way distinction, and provides a syntactic analysis of the structures 
within the framework of the Copy-plus-Merge Theory of Movement. 

In chapter 5 Sarah Ouwayda takes up the question of the interpretation of the 
construct state nominal or idaafa construction. This construction has received sig- 
nificant attention in recent years, with controversy over whether it is a referential 
(individual-denoting) expression or a predicational (property-denoting) expression. 
Ouwayda provides compelling arguments that construct state nominals are predica- 
tional expressions. She bases her argument on the interpretation of modified and 
quantified construct state nominals in Lebanese Arabic, showing that only a property- 
denoting expression (of the type <e t>) can provide the appropriate interpretations 
in a compositional manner. 

The syntax and semantics of interrogative elements in Egyptian Arabic is the 
focus of chapter 6, in which Usama Soltan argues that elements such as nun (“who”), 
which may occur both in situ and ex situ (or fronted) positions, do not undergo A-bar 
movement. He shows that neither in situ nor ex situ wh-expressions evidence island 
sensitivity or intervention effects. He argues for a uniform nontransformational syn- 
tactic analysis, on which wh-expressions receive their scope via association with an 
interrogative null operator. This operator can unselectively bind a wh-phrase either 
in an argument (in situ) position or in a focused (ex situ) position. The focused wh- 
element associates with a resumptive pronoun. Soltan’s analysis of displaced nun 
nicely parallels Elsaadany and Shams ’s account of floated quantification. 

In chapter 7 the focus turns to the history of grammatical analysis. Hana Zabara 
discusses the Arabic copular verb kana, which assigns accusative case to its object 
and nominative case to its subject but is semantically empty of predicative content. 
As Zabara shows, this class of verbs has puzzled Arabic grammarians since the ear- 
liest days of Arabic grammatical analysis. The conflict between the grammatical 
notion of a verb as case marker and the semantic notion of a verb as an element that 
involves predication has led to a complex debate about the nature of these verbs, 
which have become known as “af'dl naqisa. ” Zabara describes the problem these 
verbs present and provides an overview of the status of these verbs as seen by seven 
grammarians from the eighth to the twelfth centuries, examining how the gram- 
matical concepts used to describe these develop toward the distinction between com- 
plete verbs (those with verbal grammar and semantics) and incomplete verbs (those 
with only verbal grammar). 

Part II begins with chapter 8, in which Reem Bassiouney examines assertiveness 
techniques used by both women and men on Egyptian talk shows. These techniques 
include interruption and floor controlling. The data consist of fifteen hours of talk 
shows. The analysis includes five talk shows. Two are exclusive to one group, males 
or females and not another. All the participants are in the same age group, forty-five 
to fifty-five years. Talk shows provide an opportunity for women to compete with 
men on a professional level and to redefine their identity according to context. 
Bassiouney argues against the assumption that women in general are more polite than 




Introduction 



xiii 



men and are concerned with solidarity, whereas men are concerned with power, 
because power is in fact context dependent. 

In chapter 9 Elena Canna investigates the system of politeness in Morocco, 
where the three autochthon languages (Moroccan Arabic, Modern Standard Arabic, 
and Berber) cohabitate with the ex-colonizers’ languages of Spanish and above all 
French. Canna shows how and to what extent a foreign language such as French 
extends its power to the lowest stratum of this society and therefore how much its 
power over the other languages influences the global politeness system. 

In chapter 10 Ahmed Fakhri uses a genre analysis approach to investigate the 
functions of nominalization in Arabic legal discourse. He shows that nominalization 
serves to strengthen arguments presented in court judgments, allows for comprehen- 
sive enumeration of human conduct and activities targeted by legislative provisions, 
and enhances the affective appeal of fatwas by capturing in nominal form the content 
of Qur anic verses or Hadith. 

In chapter 11 Gunvor Mejdell provides a discussion on the status of research on 
luga wustd — the intermediate forms of Arabic. The aims of this study are twofold: 
(1) to define intermediate varieties or levels in terms of features and variants that 
characterize them; and (2) to establish rules for, or constraints on, the kind of com- 
binations of features and variants from the two basic codes that may or may not 
occur. The chapter highlights the psycholinguistic asymmetry between the two basic 
varieties, and the unequal markedness value of different features. 

In chapter 12 Catherine Miller describes the impact that the first broadcasting 
of foreign series dubbed into Moroccan Arabic had on the Moroccan sociolinguistic 
setting. She analyzes the various comments raised by this experience and shows how 
the dubbing process highlights the complexity of the Moroccan linguistic situation. 
The process involves a number of social actors and raises many sociolinguistic issues: 
Does dubbing participate in the promotion of Moroccan Arabic? Which varieties or 
which levels of Moroccan Arabic are selected, by whom, and why? Is there anything 
like a common agreed-upon standard Moroccan variety? And does Casablanca play 
a leading role in this standardization process? 

In chapter 13 Karin Christina Ryding proposes a framework for examining the 
effectiveness of Arabic learning and teaching processes from a critical thinking per- 
spective in order to identify key elements of analytical thinking that play into 
cognitive development. She identifies intellectual challenges and conceptual de- 
mands, and presents curricula that explicitly include the cognitive underpinnings of 
language structure and performance in communicative frameworks of language 
teaching. 

In chapter 14 Yasir Suleiman addresses the ideological — rather than the linguis- 
tic — content of the process of standardization and its contextual elaborations in gram- 
mar making in the Arabic linguistic tradition in the first four centuries of Islam, 
because this period witnessed the production of the first grammars of Arabic. It is 
also a period that is characterized by sociopolitical fault lines that affect grammar 
making directly. This approach helps linguists to study the ideological issues associ- 
ated with standardization in more general terms. Suleiman argues that the search for 
uniformity, correctness, purity, and identity in standardization as an ideology or dis- 
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cursive project is, at the core of grammar making in the Arabic linguistic tradition, 
providing it with its sociopolitical and moral/ethical underpinnings. 

Finally, in chapter 15 David Wilmsen examines the contrast differential use of 
alternative grammatical constructions in Modern Standard Arabic from the perspec- 
tive of the writer’s native dialect. He argues that Arab writers’ frequency of use of 
the prepositional double object construction or the prepositional dative construction 
in treating two pronominal object suffixes of ditransitive verbs is to some extent 
governed by their dialect. Thus North African writers tend toward the dative, and 
writers from the Arab east tend toward the double object construction. Assessments 
of the grammaticality of such constructions also tend to reflect the dialect of writers, 
despite the fact that both constructions are grammatical and have been used since the 
earliest Arabic writing. 
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Negation in Moroccan Arabic: 
Scope and Focus 

NIZHA CHATAR-MOUMNI 

Universite Paris Descartes 



STANDARD SENTENTIAL NEGATION in Moroccan Arabic (MA) is marked with both elements 
ma- and -/ (or its variant -ft). According to the contexts, these elements can be split 
in a discontinuous form or merged in a continuous form. For example, in direct as- 
sertions, the discontinuous form surrounds a verbal predicate (1) or a quasi-verbal 
predicate, (2) whereas the continuous form precedes a nonverbal predicate (3): 



(1 ) l-mra ma-ja-t-f l-I-'drs 

the-woman neg'-come-Bf.perf-q to-the-wedding 

“The woman did not come to the wedding.” 

(2) ma- ‘ and-ha-f l-flus 

neg-at-her-q the-money 

“She has no money.” 



(3) l-mra ma-fi 

the-woman neg-q 

“The woman is not in the wedding.” 



f-l- ‘drs 

in-the-wedding 



In marked utterances — for example, adversative utterances — the discontinuous 
form can be used with a nonverbal predicate (4) and the continuous one with a ver- 
bal predicate (5): 



(4) hmed kla- 0 ? 

Ahmed eat-3m.perf 

“Has Ahmed eaten?” 

ma-fi kla -0 ' ammar- 0 -ha ! 

neg-q eat-3m.perf fill up-3m.perf-her 

“He has not eaten, he gorged!” 



(5) toomobil(t)-hm3d jdida 
car-Ahmed new 

“The car of Ahmed is new.” 
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ma-jdida-f and- ha ‘ am-in 

neg-new-q at-her year-dual 

“It is not new, it is two years old.” 

In MA, the first element (ma-) is required in all contexts, while the second ele- 
ment (-J) can — or must — fall in various contexts. 2 In this chapter I focus on issues 
entailing the presence or the absence of -/ in the context of a verbal (or a quasi- 
verbal) predicate. Relying on Muller (1984), I argue that MA sentential negation 
results from the association between the neg(ator) ma- and a q(uantifier). 3 I claim 
that, in order to satisfy the “negative association,” ma- must be attached to an unde- 
fined quantifier. The presence or the absence of the element -/in MA is related to the 
presence or the absence of the [+undefined] feature. 

The chapter is organized as follows: The first section reviews briefly the “Neg- 
ative Cycle” (Jespersen 1917) in French and Arabic; both languages show obvious 
similarities in the process of negation renewal. The second section reviews the major 
“negative associations” in MA, and focuses particularly on the relationship between 
adverbial phrases of duration and the element fi. The last section deals with MA 
negation as a scope’ unit, that is, a unit that applies a structural control on a fragment 
of the sentence (Nolke 1994, 120). 

The "Negative Cycle" 

Negation is a major theme of research in the grammaticalization framework. Further, 
it seems that the term grammaticalization was used for the first time by Meillet 
(1912) to describe and explain, among others, the evolution of sentential negation 
from Latin to French. It is well known that negation evolves by cycles. Jespersen 
(1917) developed the process of syntactic change of negation in a grammaticalization 
pattern named later by Dahl (1979) “Jespersen’s Cycle” or “Negative Cycle” (Van 
der Auwera 2010). 

The renewal process of negation in Arabic and French is rather close. French 
sentential negation stems from the preverbal Latin negation non\ 

(6) Egeo, si non est (Cato) 

“If 1 miss something, I pass.” 

The Latin non — phonetically reduced and unstressed — evolved in Old French 
into ne and joined nouns meaning the smallest possible quantity in a given field of 
the experience, such as pas “step,” mie “crumb,” goutte “drop,” and point “stitch”: 

(7) Quel part qu ’il alt, ne poet mie chair (Chanson de Roland 2034). 

“Wherever he goes, he cannot fall a crumb.” 

These nouns are selected according to the semantic class of the verb and accord- 
ing to the denoted event — pas “step” in the context of negated verbs of motion, goutte 
“drop” with negated verbs for “to drink,” and so on — emptied gradually of their 
lexical meaning, and fixed a grammatical one by contamination with ne. The pos- 
sibilities reduced one by one in favor of point in formal register (8) and pas in infor- 
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mal register (9). Currently, in colloquial register, pas can be used alone, without the 
preverbal ne (10), in a third stage of the “Negative Cycle”: 

(8) II ne mange point (9) II ne mange pas (10) II mange pas 
“He does not eat.” “He does not eat.” “He does not eat.” 



The MA negator ma- derives probably from Classical Arabic (CA), which marks 
sentential negation with a single unit: la, lam, lan, ma or the negative copula laysa. 
As for the element -/ (or its variant -/?), it derives most likely from the CA fay 'an “a 
thing,” that is, the undefined noun fay' marked with the accusative as in (11) and 
(12) below. In these Quranic examples, fay 'an, coupled with the negation Id, means 
respectively “anything” (nominal) and “at all” (adverbial): 4 



(11) wa la t-ufrik-u bi-hi 

and neg 2,imp-associate-m.pl with-Him 

Lit.: and not associate with Him a thing. 

“And join none with Him.” (Qur’an 4:36) 



fay ’-an 

thing-acc 



(12) Alldha Id ya-zlim-u n-nds-a fay ’-an 

God neg 3m.imp-deal unjustly with the-people-acc thing-acc 

Lit.: God not deal unjustly with the people in/on a thing 
“God does not deal unjustly with the people at all.” (Qur an 10:44) 



According to Lucas and Lash (2010), fay an “is found predominantly in the 
context of negation already in CA. In the Qur’an, for example, which consists of 
approximately 80,000 words, fay 'an occurs 77 times. Of these, fully 63 (81.8 percent) 
occur in the scope of negation.” The high frequency of fay 'an “a thing” in the scope 
of negation gradually made it sensitive to the negation and led it to become a nega- 
tive polarity item (NPI). 

Hence, negation has been reinforced through a minimizer in French — that is, an 
item denoting the smallest quantity in a field of the experience (pas “step,” goutte 
“drop,” etc.) — and through a term denoting a vague, an undefined quantity (fay 'an 
“a thing”) in Arabic. On contact with negation, this smallest or vague quantity is 
reduced to a zero quantity. Michele Fruyt (2008, 2) rightly points out that: “L’histoire 
de ces termes resulte du raisonnement selon lequel, s’il y a absence d’une entite 
consideree comme infiniment petite dans un certain domaine d’experience et meme 
absence de la plus petite entite connue et concevable, il y a necessairement absence 
de toute entite et done il y a ce que Ton pourrait appeler, selon le modele mathema- 
tique, Tensemble vide’ ou bien Tabsence absolue.’ L’emploi linguistique de la 
negation correspond ici a la denotation d’une absence, puisque la negation porte sur 
une entite et non sur un proces.” 5 

Negation is closely related to quantification. That may be why, in the renewal 
process, Arabic and French negation have been reinforced through a unit denoting 
quantification. About French, Muller (1984, 94) emphasizes that “11 est bien connu 
que la negation implique une vision totale du domaine de quantification; pour dire: 
Il y a quelqu ’un dans V assistance qui est chauve, il n’est pas necessaire de voir tout 
le monde. Cela est necessaire pour pouvoir dire: Il n ’y a personne dans V assistance 
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qui soit chauve. Ce pourrait etre l’origine de la presence de quantifieurs en ancien 
frangais comme pas, mie, goutte, brin, point, sur lesquels porte la negation pour 
signifier que l’ensemble du domaine a ete pris en consideration.” 6 

Accordingly, for Muller, sentential negation results from the association between 
negation and a quantifier. In MA — as 1 argue in the two next sections — sentential 
negation results from the association between the neg(ator) ma- and an undefined 
q(uantifier). 

Negative Associations in Moroccan Arabic 

The standard “negative association” links ma- to the general and undetermined quan- 
tifier ~/(13), the reduced form of fay. In MA, the full form is still in use, most often 
to mark an emphatic negation (14). We can compare it with the French point, the 
stronger form of pas (cf. 8 and 9 above) used in formal register but also in order to 
mark an energetic negation. 



(13) ma-ja- 0 -f 


'and-na 


lyum 


neg-come-3m.perf-q 


at-us 


today 


“Fie did not come home today.” 




(14) ma-ja- 0 -fay 


' and-na 


lyum! 


neg-come-3m.perf-q 


at-us 


today 


“Fie did not (at all) come 


home today!” 




To cover the different domains of the experience, ma- attracted in its scope other 
quantifiers denoting all an undefined quantity and selected according to the semantic 
class of the verb and the denoted event. For example, for the feature [+human], ma- 
is associated to the undefined quantifier hadd, stemming from CA 'ahad “one,” the 
smallest numerical quantifier. Maiyais (1935, 399) rightly pointed out that “Aucun 
mot n’est plus apte a exprimer la valeur indefinie que le mot qui denote 1’ unite: la 
notion un exemplaire pris entre plusieurs est en effet tres proche parente de la notion 


un exemplaire non identifie”: 7 






(15) ma-faf- 0 -ni 


hsdd 


lyum 


neg-see-3m.perf-me 
“Nobody saw me today.” 


one 


today 


(16) ma-faf-t 


hsdd 


lyum 


neg-see-l.perf 


one 


today 



“I saw nobody today.” 

For the feature [-human], MA associates ma- and walu “anything” perhaps stem- 
ming from CA wa-law “and if,” “even if,” denoting the irrealis, the absence: 

(17) ma-kla -0 

neg-eat-3m.perf 
“Fie ate nothing.” 



walu 

anything 
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To cover the feature [+temporal], ma- is associated to ammar s -stemming from 
a word meaning “lifespan”-in order to mean “never.” We find a similar expression 
in French; the NPI de ma vie “in my life” (18), which cannot coexist with pas (18’): 

(18) De ma vie,je n ’ai vu pareille chose! 

“1 have never (in my life) seen such a thing!” 

(18') *De ma vie,je n 'ai pas vu pareille chose! 

Note the agreement between the first possessive determiner ma “my” and the 
first subject pronoun je “I.” 

The noun ammar would stem from the grammaticalization of an iddfa construc- 
tion (Hoyt 2005), linking in a possessive relation the noun 'ammar “lifespan” and a 
nominal-either a pronominal (19) or a noun (20)-into an agreement relation with the 
subject (Benmamoun 2006, 144): 



(19) ammar- ha ma-ja-t 

never-her neg-come-3f.perf 

“She never came at home.” 

(20) ammar-maryam ma-ja-t 

never-Meryem neg-come-3f.perf 

“Meryem never came at home.” 



l-d-dar 

at-the-home 



l-d-dar 

at-the-home 



The similarity between the French de ma vie “in my life” and the MA 'ammar (-ni), 
literally “lifespan(-me)” is interesting. Both, at a different degree of grammaticaliza- 
tion, seem to operate as adverbial phrases of duration in a topic position. We can 
compare them to MA adverbial phrases of duration as hadi fahrin “since two months” 

(21) or talt Jhur “three months” (22): 



(21) hadi fahr-in ma-ja -0 l-d-dar 

since month-dual neg-come-3m.perf at-the-home 

“Since two months, he has not come at home.” 

(22) talt Jhur ma-ja -0 l-d-dar 

three month.pl neg-come-3m.perf at-the-home 

“Since three months, he has not come at home.” 



In negative contexts, these adverbials favor the initial-sentence position. In this 
position, they frequently entail the absence of -f while they necessarily require the 
presence of -/in a final-sentence position: 

(23) ma-ja- 0 -f l-d-dar hadi fahr-in 

neg-come-3m.perf-q at-the-home since month-dual 

“He has not come at home since two months.” 



In the initial-sentence position, these adverbials are highlighted through a kind 
of cleft structure, overtly marked in (24). The presence of the relative pronoun Hi — 
carrying the feature [+defined] — entails the presence of the [-defined] element -/in 
order to realize the “negative association”: 9 
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(24) hadi fahr-in Hi ma-ja- 0 -f I-d-dar 

since month-dual that neg-come-3m.perf-q at-the-home 

“Since two months, he has not come at home.” 

I suppose that the MA ammar (+nominal) derives from an adverbial phrase of 
duration regularly highlighted at the initial-sentence position in negative contexts 
until its crystallization in this position. The noun ammar denotes a vague duration 
(“lifespan”) which, on contact with ma-, is reduced to a zero quantity. 

Benmamoun (2006) rightly points out that ‘ ammar and the MA unit baqi “still,” 
“not yet” share some properties. Both are necessary in the edge-sentence positions: 

(25) ammr-u ma ja -0 1-d-dar 

never-him neg-come-3m.perf at-the-home 

“He never came at home.” 

(26) baqi ma-ja -0 1-d-dar 

yet neg-come-3m.perf at-the-home 

“He has not come yet at home.” 

Both can be preceded by the lexical subject: 

(27) lvnad ‘ammr-u ma-ja -0 1-d-dar 

Ahmed never-him neg-come-3m.perf at-the-home 

“Ahmed never came at home.” 

(28) lvnad baqi ma-ja -0 1-d-dar 

Ahmed yet neg-come-3m.perf at-the-home 

“Ahmed has not come yet at home.” 

Both can carry agreement with the subject: 

(29) ‘ ammar-ha ma ja-t 1-d-dar 

never-her neg-come-3f.perf at-the-home 

“She never came at home.” 

(30) baqa ma-ja-t 1-d-dar 

yet neg-come-3f.perf at-the-home 

“She has not come yet at home.” 

Nonetheless, unlike 'ammar, baqi is not a NPl-in the traditional sense of this 
term-inasmuch as it can occur in negative contexts (31) and in nonnegative con- 
texts (32): 

(31) baqi ma-i-ji 1-d-dar 

yet neg-3m.imp-come at-the-home 

“He does not come yet at home.” 

(32) baqi i-ji 

still 3m.imp-come 

“He still comes at home.” 



1-d-dar 

at-the-home 
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In nonnegative contexts, baqi occurs only with the imperfective. Of course, an 
example like (33) is not possible; an achieved event cannot continue to go on: 

(33) *baqi ja -0 l-d-dar 

still come-3m.perf at-the-home 

Moreover — on this point I do not agree with Benmamoun (2006, 143n6), for 
whom “like other NPIs it [baqi] is in complementary distribution with s ” — baqi is 
compatible with the postverbal element -f as in 

(34) baqi ma-ja- 0 -f l-d-dar 

yet neg-come-3m.perf-q at-the-home 

“He has not come yet at home.” 

Furthermore, the presence or the absence of -/ can give rise to two different 
values: 

(35) baqi ma-i-ji l-d-dar 

yet neg-3m. imp-come at-the-home 

“He doesn’t come yet at home.” (He is still at work) 

(36) baqi ma-i-ji-f l-d-dar 

yet neg-3m.imp-come-q at-the-home 

“He doesn’t come yet at home.” (Again) 

In (35) negation follows from the association between ma- and baqi. This “neg- 
ative association” has scope over the predicative relation. This gives rise to a durative 
value: the denoted event lasts. In (36) negation follows from the association between 
ma- and the undefined quantifier -f. Negation has also scope over the predicative 
relation, but baqi is in the scope of negation. That gives rise to an iterative value; the 
denoted event repeats, reoccurs. Note that the iterative reading is easier to check with 
the unit ka- because, among other values, ka- marks explicitly the repetition, the 
reiteration: 

(37) baqi ma-ka-i-ji-f l-d-dar 

yet neg-pr-3m.imp-come-q at-the-home 

“He does not come again at home.” 

Note that -/ seems necessary with the unit ka-; example (38) is perhaps possible, 
but deviant: 

(38) ? baqi ma-ka-i-ji l-d-dar 

yet neg-pr-3m.imp-come at-the-home 

We can compare the MA baqi with the French encore "’still,” “yet,” which also 
takes different interpretations according to the scope of negation. If encore is in the 
scope of negation, its value is durative: 

(39) II nest pas encore venu da maison 
“He has not come yet at home.” (Still) 
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If negation is in the scope of encore, its value is iterative: 

(40) II nest encore pas venu da maison 

“He has not come yet at home.” (Again) 

To sum up, MA ma- has attracted in its scope different items denoting all an 
undetermined, an undefined quantity reduced, on its contact, to a zero quantity: f(ay), 
hoadd, wain, ' ammar . Regularly under the scope of negation, these quantifiers be- 
came sensitive to negation. All are NPIs in complementary distribution. 

Moroccan Arabic Negation Is a Scope Unit 

The MA structure hatta + NP — frequently translated into “any” + NP — is regularly 
analyzed as a NPI in complementary distribution with the element -/(Benmamoun 
1997, 2006; Ouhalla 2002) because of contexts such as (41a) and (41b). These ex- 
amples and their glosses are from Benmamoun (1997, 269): 

(41a) ma-qrit hatta ktab 

neg-read. 1 S even book 

‘I didn’t read any book.’ 

(b) *ma-qrit-s hatta ktab 

neg-read. 1 S even book 

‘I didn’t read any book.’ 

MA negation is a “scope unit,” that is to say, a unit that applies a structural 
control on a fragment of the sentence (Nolke 1994, 20). This fragment can be the 
predicative relation or a single term of the sentence (Caubet 1983; Marpais 1935). 

For example, in nonnegative contexts, continuous (mass) nouns are quantified 
with the definite article (42), and the discontinuous (count) nouns are bare (43). 10 For 
Marpais (1935, 395) the bare quantification represents “le degre zero de la determi- 
nation [the degree zero of the determination]”: 

(42) hmed fra -0 l-hlib 

Ahmed buy-3m.perf the-milk 

“Ahmed bought milk.” 

(43) lvned fra -0 ktab 

Ahmed buy-3m.perf book 

“Ahmed bought a book.” 

In negative contexts, if negation has scope over the predicative relation, the ele- 
ment -/ does not fall in the context of a noun quantified with the definite article: 

(44) ma- ' and-u-f l-flus 

neg-at-him-q the-money 

“He does not have money.” 

(45) ma- ‘and-u-f l-ktab 

neg-at-him-q the-book 

“He does not have the book.” 
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But it falls in the context of a bare noun: 

(46) ma- and-u fins 

neg-at-him money 

“He does not have some money.” 

(47) ma- and-u ktab 

neg-at-him book 

“He does not have a book.” 

If negation has scope over a single term of the predicative relation, -/ does not 
fall, even if it is in the context of an indefinite noun: 

(48) ma- and-u-f fins ‘and-u l-mlayar 

neg-at-him-q money at-him the-billion 

“He doesn’t have money, he has billions.” 

(49) ma-fra- 0 -f ktab fra -0 majalla 

buy-3m.perf-q book buy-3m.perf magazine 

“He did not buy a book, he bought a magazine.” 

In these two contrastive contexts, the nouns flus “money” and ktab “book,” al- 
though they are not marked with the definite article, are not undefined. Quite the 
reverse; they are specified because focalized through the intonation (or a discernible 
break). Focalization is also a process of definiteness that allows a speaker to specify 
the focus of his or her utterance. The focus identification is necessary to the inter- 
pretation process (Nolke 1994, 94). 

In (50), the quantifier wahad “one” is associated with ma- to deny the predicative 
relation. In this context, wahad means “a X unspecified,” hence -/is not necessary: 

(50) ‘rad-na-hum kull-hum 1-1- ‘ ars u wahad ma-ja -0 

invite- l.pl.perf-them all-them at-the wedding and one neg-come-3m.perf 

“We invited all to the wedding and nobody came.” 

In (51) wahad no longer operates as the second element of the negative associa- 
tion — that is, as an undefined quantifier — but means “a X specified,” “a X defined”; 
-/is then necessary to fulfill the “negative association.” In (51) wahad, marked 
through the intonation, is the focus of the relation: 

(51) ‘ rad-na-hum kull-hum l-l- ‘ars u wahad maja- 0 -f 

invite- l.pl.perf-them all-them at-the wedding and one neg-come-3m.perf-q 

“We invited all to the wedding and (only) one did not come.” 

In MA, as in French, the speaker/enunciator can resort to another process of 
focalization — the focus markers named by Nolke (1983) adverbes paradigmatisants 
“paradigmatic adverbs”: “Un adverbe paradigmatisant introduit en tant que presup- 
pose un paradigme d'elements semblables a l'element auquel il est attache dans la 
phrase actuelle” (Nolke 1983, 19). 11 
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The MA units gir “only” and hdtta “even,” “until,” particularly in the context of 
a noun, operate as paradigmatic adverbs, that is, as focus markers. 12 The structure 
hdtta + NP can occur in nonnegative contexts (52 and 53): 

(52) lwtta l-mra ja-t 1-1- ' drs 

even the-woman come-3f.perf at-the-wedding 

“Even the woman came to the wedding.” 

(53) hdtta mra ja-t 1-1- 'drs 

even woman come-3f.perf at-the-wedding 

“Even a woman came to the wedding.” 

and in negative contexts: 

(54) hztta mra ma-ja-t 

even q. woman neg-come-3f.perf 

Lit.: even a woman has not come to the wedding 
“No woman came to the wedding.” 

(55) hdtta Ji-mra ma-ja-t 1-1- ‘drs 

even q-woman neg-come-3f.perf at-the-wedding 

Lit.: even some woman has not come to the wedding 
“No woman came to the wedding.” 

The noun mra gets an undefined quantification through the “degree zero of 
determination” (54) or through the indefinite quantifier fi (55). The “negative asso- 
ciation” is fulfilled in this way. In these two examples the negation has scope over 
the predicative relation: The word hdtta marks mra as the focus of the relation and 
introduces the presupposition that any member of the paradigm (set of similar ele- 
ments), not even “a woman unspecified” did not validate the predicative relation, did 
not come to the wedding. 

(56) hdtta l-mra ma-ja-t-f 1-1- ‘drs 

even the woman neg-come-3f.perf-q at-the-wedding 

“Even the woman has not come to the wedding.” 

(57) hotta mral3 ma-ja-t-J l-l-'ors 

even woman neg-come-3f.perf-q at-the-wedding 

“Not even one woman has come to the wedding.” 

The noun mra gets a defined quantification through the definite article in (56) 
or through a distinct intonation in (57); the presence of -/is thus necessary to realize 
the “negative association.” The word hdtta marks mra as the focus of the negation 
and introduces the presupposition that all the members of the paradigm did not come, 
not even “one specified woman.” 

Consider, finally, examples (58) and (59). In these utterances the structure 
hddtta + NP is no longer a verbal argument but an adjunct; -/ falls in (58), but it is 
present in (59): 



1-1- drs 

at-the-wedding 




(58) hddtta nhar ma-ja -0 

even day neg-come-3m.perf 

Lit.: even a day he has not come at home. 

“There isn’t a day he has come at home.” 

(59) hddtta nhar ma-ja- 0 -f 

even day neg-come-3m.perf-q 

Lit.: even a day he has not come at home. 

“There is no day he has not come at home.” 

In (58) negation has scope over the predicative relation. This relation is not 
validated, for any member of the paradigm of nhar “day,” even “any day.” The “de- 
gree zero of the determination” takes on an undefined value. Because of the “negative 
association” linking ma- and an undefined quantification, hddtta nhar, even if an 
adjunct belongs to the argumental structure of the verbal predicate and, thus, is in 
the scope of negation. 14 We can further move hddtta nhar in the right-sentence posi- 
tion (argument slot) while preserving the same meaning: 

(60a) ma ja hddtta nhar 1-d-dar 

“There isn’t a day he has come at home.” 

(b) ma-ja 1-d-dar hddtta nhar 

“There isn’t a day he has come at home.” 

This is not possible in example (59). The NP hddtta nhar is necessarily in the 
initial-sentence position, out of the scope of negation. 

Example (59) is not a direct assertion but an adversative assertion, linked to a 
previous context. Thus, we must consider it in a polyphonic context (Ducrot 1984). 
In the pragmatics framework we usually distinguish between descriptive negation — 
the affirmation of a negative content presented by the speaker as his own — and po- 
lemic negation — the refutation of a content expressed previously by another speaker. 
In (59) the enunciation “there is no day he has not come at home” is a refutation act 
that denies a positive utterance of a forward speaker who has said “there is a day he 
has not come at home.” The difference between the two interpretations results from 
the presence or the absence of the postverbal element -f 

Conclusion 

In this chapter 1 have argued that MA sentential negation results from the association 
of the negator ma- and an undefined quantifier. If this association is satisfied through 
another way, the element -f falls; if not, -/ is necessary to realize the “negative 
association.” 

MA sentential negation, as in many languages, is an operator , that is, a unit that 
applies a structural control on a fragment of the sentence (Nolke 1994, 120). The 
term in the scope of negation is the focus of the enunciation. The interpretation of a 
negative utterance depends on the scope’s perspectives. 



1-d-dar 

at-the-home 



l-d-dar 

at-the-home 
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It would be interesting to consider the contexts licensing the presence or the 
absence of -/in nonverbal sentences, and to propose a unified analysis of sentential 
negation in verbal and nonverbal sentences. 

ACKNOWLEDGMENTS 

The author would like to thank the audience at the Georgetown University Round Table on Languages 
and Linguistics 2010 for valuable comments and discussions, which greatly helped to improve the content 
of this chapter. 

NOTES 

1. Abbreviations: acc, accusative; f, feminine; imp, imperfective; indef, indefinite; lit, literally; m, 
masculine; neg, negator; perf, perfective; pi, plural; pr, present; q, quantifier; 1, 2, 3, first, second, 
third person. 

2. I refer the reader to Benmamoun 1997, 2006; Brustad 2000; Caubet and Chaker 1996; Chatar- 
Moumni 2008; Harrell 1962; Hoyt 2005; and Marais 1935. 

3. See also Schapansky 2010. 

4. With respect to these examples, I refer the reader to Souag (2006, 21) and Suleiman (1999, 1 14 ff.). 

5. The story of these terms follows from the analysis that, if there is absence of an entity considered 
as infinitely small in a certain field of the experience and absence of even the smallest entity known 
and conceivable, there is necessarily the absence of any entity and therefore there is what might be 
called, as the mathematical model, the “empty set” or “the absolute absence.” The use of linguistic 
negation corresponds here to the denotation of an absence, because negation has scope over an entity 
and not over a process. 

6. It is well known that negation implies a total vision of the field quantification; to say: There ’s some- 
one in the audience who is bald, it is not necessary to see everyone. This is necessary in order to 
say There is nobody in the audience who is bald. This could be at the origin of the presence of 
quantifiers in Old French as step, crumb, drop, strand, stitch, which the negation focuses on, to mean 
that the whole domain has been considered. 

7. No word is more likely to express an undefined value as the word indicating the unit: the concept a 
specimen taken from several is indeed very closely related to the concept a specimen unidentified. 

8. The word ’ommor can be used in a nonnegative context: ‘ ommor-ha ja-t l-d-darl “has she never 
came at home?” But it is well known that interrogation is also a negative trigger for NPI. 

9. In the initial sentence position, such AP have in their scope the whole clause. Hence, they take on a 
syntactic nucleus status. Benmamoun (2006) gives to the NPI 'ommor a head status. 

10. Quantification is understood here as the operation of extension that consist to take from a set of 
elements a certain quantity, determinate or indeterminate (Caubet 1983). 

1 1 . A paradigmatic adverb introduces as a presupposed a paradigm of elements similar to the element 
at which it is attached in the current sentence. 

12. See also Ouhalla 2002, 18n2. 

13. In example (56), the noun mra “woman” is marked through a distinct intonation. 

14. For Muller (2003, 76), “la portee de la negation associee a un indefini negatif couvre obligatoirement 
F indefini et la construction argumentale du verbe dependant” [the scope of negation associated to a 
negative indefinite covers necessarily the indefinite and the argument construction of the depen- 
dent verb]. 
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THIS CHAPTER INVESTIGATES the syntax and semantics of Arabic universal quantification 
from three different perspectives. The first one is the transformational approach that 
considers the Arabic universal quantifier hull and its two different structures, which 
we call the unmarked “NP ad jQ” and the marked “FQ,” as base-generated and that the 
marked construction is a “floated” quantifier. The second approach considers the 
marked FQ construction an adjoined adverb and does not posit any transformational 
link between the marked FQ and the associated DP nominal constructions. Our third 
lexical-functional approach proposed in this study proves that the two quantified 
structures are semantically different; accordingly, they show different syntactic con- 
stituencies. It argues that there is no “floating” involved in deriving the marked and 
unmarked quantification constructions. It also shows that there is no movement in- 
volved in the marked FQ construction, and there is no “floating” involved in deriving 
the marked and unmarked quantification constructions. This Lexical Functional 
Grammar (LFG) analysis treats the marked FQ as an instance of topicalization. The 
TOPIC function is identified by its anaphoric binding and coreference with the SUBJ 
function that is represented by the pronominal clitic attached to the Q hull in the 
subject position, a requirement that satisfies the Extended Coherence Condition 
(ECC). The study argues that whenever such anaphoric binding between the TOPIC 
and the so-called floated Q kull is absent, this TOPIC will not be identified, and thus 
the ECC will be violated, thus rendering these constructions ungrammatical. 

The Phenomenon 

In (la), the Arabic quantifier (Q) hull “all” appears adjacent to the determiner phrase 
(DP) ’at-tula:b “the students” and seems to be semantically composed with this DP. 
The Q kull may also appear nonadjacent to ’at-tulla:b “the students” as in (lb). The 
Q kull in (la) will be called in this chapter the unmarked 'NV^cerA Quantifier, which 
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is represented as NP ad jQ. In (lb), we follow the standard practice of calling elements 
like kull-u-hum in (lb) the so-called floating quantifier or the marked FQ: 



(la) kull ’at-tulla:b katab- 

all the-students-3PL.MASC wrote 


u: ’ad-dars-a 
the-lesson 


[unmarked NP ad jQ] 




“All the students wrote the lesson.” 






(b) 


’at-tulla:b katab-u: 


kull-u-hum 


’ad-dars-a 


[marked FQ] 




the-students-3PL.MASC 


wrote 


all-3PL.MASC the-lesson 




“The students wrote all the lesson.” 






(c) 


’at-tulla:b 


kull-u-hum 


katab-u: 


’ad-dars-a 




the-students-3PL.MASC 


all-3PL.MASC 


wrote 


the-lesson 




“The students all wrote the lesson.” 






(d) 


* ’at-tulla:b 


katab-u: 


kull 


’ad-dars-a [FQ] 




the-students-3PL.MASC 


wrote 


all 


the-lesson 




“The students wrote all the lesson.” 






(e) 


* ’at-tulla:b 


kull 


katab-u: 


’ad-dars-a 




the-students-3PL.MASC 


all 


wrote 


the-lesson 



“The students all wrote the lesson.” 

Proposed Arguments 

The Arabic universal quantifier hull in example (1) shows an apparent mismatch 
between the position of the quantificational element and its interpretation. Various 
responses to this challenge have been proposed. 

In this chapter we discuss and propose three major approaches to this quantifica- 
tion problem in Arabic. The first approach proposes a transformational, derivational- 
based analysis of the marked FQ and unmarked NP ad jQ constructions in Arabic. This 
approach has been used by Sportiche (1988), Miyagawa (1989), Shlonsky (1991), 
Merchant (1996), Boskovic (2004), and many others. The second approach eschews 
this transformational treatment and treats the marked FQ construction as an adverbial 
element that does not directly quantify over related nominal. This quantification 
treatment has been also adopted by other scholars, such as Dowty and Brody (1984), 
Bobaljik (1995), Doetjes (1997), and Brisson (2000). 

Our proposed third approach proposes an analysis of this quantificational phe- 
nomenon in Arabic in the LFG framework as developed and used by Bresnan (2000), 
Falk (2001), and Dalrymple (2001). This proposed LFG approach provides a new 
analysis that uses nonderivational LFG configurations. In this lexical approach to 
analyzing quantification in Arabic, we argue that sentences (la) and (lb) present two 
different semantic and syntactic structures that involve two different, yet morpho- 
logically related, quantifiers: the unmarked NP ad jQ kull and the marked FQ kull. This 
claim is supported theoretically and empirically in the LFG framework. Because 
these two structures — that is, NP ad jQ and FQ — show different c-structures and f- 
structures, we believe that one structure can be derived from the other. To this end 
the parallel architecture of LFG will help us to accurately describe, explain, and 
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analyze this quantificational problem in terms of Arabic internal characteristics. Al- 
though our LFG analysis claims that the Arabic quantifier kull does not float, we will 
be using the term FQ for convenience. 

The third section presents the first two arguments of FQ as previously analyzed 
from the transformational and derivational perspectives. In the fourth section we 
discuss the semantics of the Arabic quantifier hull, its properties as either unmarked 
NP ad jQ or marked FQ, and the semantic differences between both of them. The fifth 
section discusses the syntax of both NP ad jQ and FQ constructions from an LFG 
perspective. We argue that the Arabic Q hull is an independent functional category 
Q and that its c-structure position is the head of QP. We claim that the Arabic NP ad jQ 
construction is neither the floated nor the adverbial version of FQ as the transforma- 
tional and adverbial approaches claim in the third section. This chapter considers this 
marked construction as a topicalized construction involving triggered inversion. Fi- 
nally, the sixth section provides the study conclusions and suggestions for further 
research. 

Nonlexical Accounts of Arabic Universal Quantification 

In this section we provide two major analyses of Arabic quantification. The first 
analysis follows a transformation or a derivational-based analysis. The second one 
adopts an analysis that treats the Arabic quantifier kull as an adverb. 

The Transformational Analysis of 
Arabic Universal Quantification 

Since Sportiche (1988) and Koopman and Sportiche (1991), researchers have as- 
sumed that subjects originate in a structural position lower than their observed posi- 
tion. This subject position originates in the ‘'‘VP-internal subject” position. Given this 
assumption, example (la) must involve movement of the DP kull ?aT-Tulla:b “all 
the students” as shown in (2): 

(2a) [ip [dp — ] [f [vp [dp kull ’at-tula:b] [v’[v katab-u:] [np ’ad-dars-a]]]]] 
(b) [ip [dp kull ’at-tulaib]; [i’[vp [dp t ; ][vp[v katab-u:] [np ’ad-dars-a]]]]] 

If the dp kull ’at-tulla:b can be moved in this manner in (2), then the dp ’at-tulla:b 
can also be moved alone, excluding the Q kull. This assumption along with the VP- 
internal subject assumption derived in (2) results in the so-called floating quantifica- 
tion through the derivation in (3), which represents (lb) above: 

(3a) [ip [dp — ] [f [vp [dp kull ’at-tula:b] [v’[v katab-u:] [np ’ad-dars-a]]]]] 
(b) [ip [dp ’at-tula:b]i [i’[vp [dp kul-u-hum t ; ][vp[v katab-u:] [np ’ad-dars-a]]]]] 

This derivational approach suggests that the so-called FQ structures and the 
NP ad jQ structures are transformationally related to each other, as the representations 
in (2a) and (3a) show. This means that the structures in (la) and (lb) are identical. 
Like Sportiche (1988) in his analysis, we can say that the quantifier kull universally 
quantifies over the set represented by the DP ’at-tulla:b in both (la) and (lb). There- 
fore, the Q kull composes semantically with ’at-tulla:b to provide the same meaning 
in nonfloating constructions. 
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Syntactically, the NP ad jQ and the FQ show dependency accordingly. This means 
that if the quantifier modification of DP is the same in (la) and (lb), whether the Q 
kull-(hum) is DP adjacent or it appears stranded from it, both constructions have the 
same underlying syntactic structure. The difference in the surface structure in (la) 
and (lb) can be captured through the derivational mechanism in (4): 




This analysis relies upon the VP-internal Subject Hypothesis assumed by Spor- 
tiche (1988). The dp kull ’at-tulla:b “all the students” originates in the V’ internal 
thematic position of Subject, A-moves to the SPEC IP to get Case, leaving the Q in 
situ. Although the Q kull is stranded from its NP ’at-tulla:b, the antecedent-anaphor 
relationships still hold, subject to Principle A. This is why the Q hull will have a 
pronominal clitic that agrees with the original NP ’at-tulla:b. Therefore, the Q kull 
will surface as kull-u-hum in order to satisfy this antecedent-anaphor relationship. 

This analysis captures the observation that was the major motivation for a trans- 
formational or derivational relation between (la) and (lb): The Q kull modifies the 
DP and agrees with it because the [Q DP] is a single constituent at the D-structure. 
Because the Arabic FQ kull must agree with the DP it modifies, we can claim that 
the Q is a head that selects a DP as its complement and forms a QP. This is illustrated 
in (5) and (6): 




ON THE SYNTAX AND SEMANTICS OF ARABIC UNIVERSAL QUANTIFICATION 



21 





Such representation is also adopted by Shlonsky (1991), Merchant (1996), and 
Boskovic (2004), who argue that in relation to the NP, the Q (kul l in Arabic) is a head 
rather than an adjunct or specifier. Their convincing argument consists in their ac- 
count of the internal structure of the QP in (5) and the mechanism of extraction as 
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represented in (6). Their analyses involve the splitting apart of a DP by movement 
of some sub-DP constituent. 

Proponents of the transformational or derivational analyses of the so-called FQ 
stress four advantages of such approach. First, the floating Q is both compatible with 
the VP-internal subject hypothesis (Koopman and Sportiche 1991) and also provides 
evidence for it. Second, this approach seems to explain the observed semantic simi- 
larity between the so-called FQ and the NP ad jQ. As shown above, FQ are structurally 
non-FQ. Any semantic differences between FQ and NP ad jQ constructions will be- 
come a challenge to this transformational/derivational approach of analyzing univer- 
sal quantification. Third, this transformational approach explains the agreement 
patterns that often arise with FQs as they occur in some languages (see Merchant 
1996; Shlonsky 1991; Boskovic 2004; Benmamoun 1999). With a few exceptions, 
when a language such as Arabic shows agreement between a Q and its constituent- 
mate nominal DP, this same agreement relationship appears in floated contexts. This 
can be seen in the invariant agreement in the examples in (7): 

(7a) * ’at-tulla:b katab-u: hull ’ad-dars-a [FQ] 

the-students-3PL.MASC wrote all the-lesson 

“The students wrote all the lesson.” 

(b) * ’at-tulla:b hull katab-u: ’ad-dars-a 

the-students-3PL.MASC all wrote the-lesson 

“The students all wrote the lesson.” 

(c) *kull-u-hum ’at-tula:b katab-u: ’ad-dars-a 

all-them the-students-3PL.MASC wrote the-lesson 

“The all students wrote the lesson.” 

Finally, the proponents of the transformational/derivational analysis of the so- 
called FQ explain that the distributions of FQs appear in original or intermediate 
positions of nominal/argument phrases. In the fifth section below, we show that this 
analysis does not appropriately account for such a phenomenon. 

The Adverbial Analysis of 
Arabic Universal Quantification 

The second nonlexical functional approach of analyzing the so-called FQ as an al- 
ternative to the transformational/derivational approach discussed in the previous sec- 
tion is the adverbial analysis that treats FQs as adverbs. This approach rejects the 
idea that floating or stranded quantification is the same as the nonfloating quantifica- 
tion that has been moved or changed through transformation. The adverbial analysis 
of floated quantification considers the unmarked NP. [( yQ and the marked FQ con- 
structions different and posits a different derivation for each one. The proponents of 
the adverbial approach have founded their analyses of the marked FQs as adverbs on 
direct empirical arguments (cf. Bobaljik 1995; Brisson 2000; Nakanishi 2003, 2004) 
and on the fact that the marked FQs occupy positions in which adverbs canonically 
surface. 
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The adverbial analysis of marked FQ does not posit any transformational link 
between the FQ and the associated DP nominal constructions. The FQ is treated as 
an adjunct that can be put somewhere in the verb phrase or in lower inflectional 
domains. The Q is considered as an NP adjunct in this approach. When the Q (see 
lb-d) bears the agreement clitic, the DP and the Q cannot be argued to share the 
same syntactic derivation as argued in the transformational approach. In such con- 
structions, the Q along with its agreement clitic will be considered as a verb phrase 
adverb. 

In Arabic, the so-called FQ contains a pronominal clitic. This pronominal ele- 
ment is related semantically to the associated DP/NP. When the Arabic Q hull is 
so-called floated, it requires a pronominal clitic as shown in (8): 



(8a) 


’al-’awla:d-u 


xaraj-u: 


kull-u-hum 




the-boys-NOM 


went out 


all-NOM-them 




“The boys went out all 


l.” 




(b) 


’al-’awla:d-u 


kull-u-hum 


xaraj-u: 




the-boys-NOM 


all-NOM-them 


went out 




“The boys all went oul 






(c) 


* al-’awla:d-u 


xaraj-u: 


kull 




the-boys-NOM 


went out 


all 



“The boys went out all.” 

Clitic pronouns in the Arabic marked FQs (cf. 8a, b) always show agreement for 
person, number, gender, and case with their nominal associates. 

When the Q kull is in an unmarked position, no such pronominal is required or 
allowed, as shown in (9): 



(9a) 


kull 


’al-’awla:d 


xaraj-u: 




all 


the-boys 


went out 




“All the boys went out 






(b) 


* kull-u-hum 


’al-’awla:d -u 


xaraj-u: 




all-NOM-them 


the-boys-NOM 


went out 




“All the boys went out 






(c) 


* 'al- ’awla:d 


kull 


xaraj-u: 




the-boys 


all 


went out-they 



“The boys all went out.” 

In Arabic, the marked FQ kull can carry the same set of pronominal clitics as 
nouns. This is illustrated in (10): 



kull-u-na: 


VS. 


balad-u-na 


kull-u-kum 


vs. 


balad-u-kum 


kull-u-kunna 


vs. 


balad-u-kuna 


kull-u-hum 


vs. 


balad-u-hum 


kull-u-hunna 


vs. 


balad-u-hunna 
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The adverbial analysis of the marked and unmarked quantification constructions 
fails to explain the full agreement of the pronominal clitic in the Q hull with its an- 
tecedent DP/NP. This makes it a weak approach to account for the differences be- 
tween the marked and unmarked FQ constructions in Arabic and other languages. 
This weakness will be highlighted further in the fourth and fifth sections. 

The Semantics of Marked FQ in Arabic 

The Arabic unmarked NP ad :Q and marked FQ kull constructions can be logically 
represented by the universal quantifier V. The Arabic quantifier kull is polysemous. 
It can be translated into English all, any, every, each, entire, entirely, and whole, as 



shown in (11): 






(11a) kull al-’awla:d 

all the-boys 

“ All the boys went out.” 


xaraj-u: 
went out 




(b) kull ’a t-ta‘a:m 

entire/whole food 

“The entire/whole food is finished.” 


xilis 

finished 


[spoken Arabic] 


(c) kull walad 

each boy 

“ Each boy ate an apple.” 


’akala 

ate 


tufa:hat-an 

apple 


(d) kull bint 

every girl 

“ Every > girl selected a book.” 


ixta:r-at 

selected 


kita:b-an 

book 



As we have shown thus far in this chapter, we restrict the semantics of Arabic 
universal Q kull to the English plural all. The meaning of Arabic Q hull is restricted 
to the plural meaning expressed in examples (la, c). 



The Semantics of Kull in the 
NP ad jQ and FQ Constructions 

We limit our investigation of Arabic universal quantification in this chapter to the 
unmarked construction NP ad jQ and the marked construction FQ as represented in (la, 
b), repeated in (12): 

(12a) kidl ’at-tulla:b katab-u: ’ad-dars-a [unmarked NP ad jQ] 

all the-students-3PL.MASC wrote the-lesson 

“All the students wrote the lesson.” 

(b) ’at-tulla:b katab-u: kull-u-hum ’ad-dars-a [marked FQ] 

the-students-3PL.MASC wrote all-them the-lesson 

“The students wrote all the lesson.” 

Although we use the term FQ to refer to the marked Q kull in (12b), we propose 
that both the unmarked NP ad jQ and the marked FQ are two different constructions 
that are syntactically and semantically distinct. 
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In the next section we provide three major semantic differences between the 
unmarked NP ad jQ and the marked FQ kull constructions. These semantic differences 
consist in the predication type, quantification assignment, and scope ambiguity that 
both of them take or assign. 

Predication Types 

When the Arabic Q kull is used with verbs that express either a distributive interpre- 
tation or a collective predication, the meaning will change according to the marked 
or unmarked position of the Q kul. Example (13) shows such differences in meaning: 

(13a) kull ’at-tulla:b ’axad-u: tufa:hat-an 

all the-students took-they apple-an 

‘‘All the students took an apple?” 

(b) ’at-tulla:b-u ’axad-u: kull-u-hum tufa:hat-an 

the-students-NOM took-they all-NOM-them apple-an 

‘‘The students all took an apple.” 

Example (13) has both a distributive and a collective interpretation. The dis- 
tributive interpretation indicates that if there is a group of ten students, then each one 
took an apple; thus the total number of apples is ten. Besides this distributive mean- 
ing of the Q kull in (13a), the sentence can also mean that all the ten students took 
only one apple; this is a collective interpretation. Compared with (13a), (13b) has a 
collective reading only. If the group of such students is ten, then the ten students took 
only one apple. This different semantic interpretation between the unmarked NP adj Q 
and the marked FQ can be further illustrated by example (14): 

(14a) kull ’at-tulla:b ’axad-u: tufa:hat-an wa-Omar ’axada tufa:hat-an 

all the-students took apple and-Omar took apple 

“All the students took an apple, and Omar took an apple.” 

(b) ?# ’at-tulla:b ’axad-u: kull-u-hum tufa:hat-an wa-Omar ’axada 

tufa:hat-an 

the-students took all-NOM-them apple and-Omar took apple 

“The students took all an apple, and Omar take an apple.” 

If Omar is an included member of the students, then the distributive interpreta- 
tion of (14a) is fine, because Omar, like other students, took an apple. Conversely, 
(14b) may sound odd if the distributive interpretation is applied with the marked FQ 
kul-u-hum. If we assume that all the students collectively took an apple, it is redun- 
dant and sounds odd to assert that Omar, who is included in this group of students, 
also took an apple. 

Quantification Assignments 

The second major semantic difference between the unmarked NP adj Q and the marked 
FQ kull is the type of quantification each construction assigns. Generally speaking, 
both constructions take a plural noun and a plural verb, as illustrated in all the ex- 
amples discussed so far. Still, the unmarked NP ad| Q construction ranges over sets, 
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whereas the marked FQ construction ranges over members of sets. As an Arabic 
universal quantifier, the marked FQ hull must range over the whole set. This is to say 
that the Q refers to each member of the set. This also applies to the collective inter- 
pretation, which assumes that each member of the set is counted. In case of the 
marked Q kul, this quantification interpretation is absent. Examples (15) and (16) 
illustrate this point. The difference between the marked NP ad jQ and the marked FQ 
constructions is logically represented in (15b) and (16b): 

(15a) hull al-bana:t Jaqra:w-a:t 

all the-girls blond 

‘‘All the girls are blond.” 

(b) Vx( Gx — > Bx) 

(16a) al-bana:t kull-u-hunna Jaqra:w-a:t 

the-girls-3PL.FEM all-3PL.FEM blond 

‘‘The girls are all blond.” 

(b) \/x(xl...n is a girl Bx) 

As can be seen from (15) and (16), the marked FQ kull construction quantifies 
over individuals, whereas the unmarked NP adj Q construction quantifies over an 
empty set. 

Arabic Quantifier Scope 

The unmarked NP ad :Q and marked FQ kull constructions result in scope ambiguities 
when used with modality and negation structures. We agree with Dowty and Brodie 
(1984) that the available readings vary according to the quantifier. Example (17) 
expresses modality use: 

(17a) kull ’at-tulla:b yumkin yinjah-u: 

all the-students can succeed 

‘‘All the students can/may succeed.” 

(b) ’at-tulla:b-u yumkin kull-u-hum yinjah-u: 

the-students may all succeed 

‘‘The students can/may all succeed.” 

Two ambiguous readings are available in (17a). First, the unmarked universal Q 
kull takes scope over the modal yumkin “can/may”; that is, all the students can suc- 
ceed. Second, it can take a narrow scope below the scope of the modal, and the 
sentence means that it is possible that all the students win. As for (17b), there is only 
one reading. The marked universal Q kull takes a narrow scope below the scope of 
the modal yumkin “can/may,” and the sentence means that it is possible that all the 
students win. Borrowing Bobaljik’s (2003) description, the marked FQ kull is “fro- 
zen” in its scope; that is, its scope is derived from its in situ position in this so-called 
floated position. 
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The unmarked NP ad jQ in ( 1 8a) takes scope over negation; the only reading is 
that no student succeeded. In ( 1 8b) the marked universal Q kull takes scope over 
negation; thus the only reading for the sentence is that no student succeeded: 



(18a) 


kull 


’at-tula:b 


lam 


yinjah-u: 




all 


the-students 


did not 


succeed 




“All the students did not succeed.” 






(b) 


’at-tulla:b 


kull-u-hum 


lam 


yinjah-u: 




the-students 


all 


did not 


succeed 




“The students did not all succeed.” 






(c) 


’at-tulla:b 


lam 


yinjah-u: 


kull-u-hum 




the-students 


did not 


succeed 


all-them 



“The students did not succeed all.” 

The above-noted semantic differences between the unmarked and marked quan- 
tifiers discussed here show that the two structures are not the same. These semantic 
differences necessitate syntactic differences as well. In (15b) and (16b), we have 
shown that the unmarked and marked constructions of the Arabic universal quantifier 
kull reveal different logical representations. Accordingly, the claim of the transfor- 
mational and adverbial approaches that the two structures are syntactically equivalent 
and semantically identical is not supported. This buttresses our claim that these two 
quantification structures are not semantically identical; thus they are also syntacti- 
cally different. The fifth section illustrates this point further. 

An LFG Account of Arabic Universal Quantification 

The Topic alization Argument for 
the Marked FQ Construction 

This section highlights our claim that the unmarked NP ad jQ and the marked FQ 
constructions belong to two different constructions from the perspectives of Lexical 
Functional Grammar (LFG) (Bresnan 2000; Dalrymple 1993, 2001). Our main LFG 
account of the marked FQ construction is that it is an instance of topicalization 
(TOPIC). The pronominal clitic in the Arabic Q hull can be accounted for by the LFG 
principle ECC, as suggested by Dalrymple (2001, 185) and Bresnan (2000, 62). 
Zaenen (1985), Fassi-Fehri (1988), and Bresnan and Mchombo (1987) were the first 
to argue that argument and nonargument functions are required to be integrated in 
the f-structure if they bear an appropriate relation to a predicate (PRED). Following 
on this, Dalrymple (2001, 185) formulates the ECC as follows: 

(19) Extended Coherence Condition [ECC]: FOCUS and TOPIC must be 
linked to the semantic predicate argument structure of the sentence in 
which they occur, either by functionally or by anaphorically binding an 
argument. 

As for the marked FQ kull construction, our analysis reveals that the pronominal 
clitic attached to kull is anaphorically bound by the TOPIC, which is the inverted 
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antecedent as shown in (lb) and (lc), which are repeated here for convenience 
as (20a, b): 

(20a) [’at-tulaib-TOPic]; , katab-u: [kull-u-hum- subj ] ; ’ad-dars-a 

the-students-3PL.MASC wrote all-3PL.MASC the-lesson 

“The students wrote all the lesson.” 

(b) [’at-tulaib-TOPic]; , [kull-u-hum- subj ], katab-u: ’ad-dars-a 

the-students-3PL.MASC all- 3 pl.masc wrote the-lesson 

“The students all wrote the lesson.” 

As can be seen from (20), the Arabic Q hull with its pronominal clitic that agrees 
in number, gender, person, and case with its antecedent NP 'at-tulla:b “students” 
functions as the SUBJ of the of the main verb. According to the ECC described in 
(19), the cliticized pronoun in the Q hull is anaphorically bound by the topicalized 
NP ’ at-tulla:b\ therefore, the pronominal clitic takes its identification from the TOPIC 
via coindexation shown in (20). The TOPIC in LFG is considered bound if it is 
functionally identified with, or anaphorically binds, a bound function. This is why 
the marked FQ kull must have an obligatory pronominal clitic that totally agrees with 
its antecedent. 

There is much support for our LFG analysis of the marked FQ and its associate 
as topicalized construction. One piece of evidence is that a TOPIC must be definite 
and sentence initial, as illustrated in (21): 

(21) * tulla:b katab-u: kull-u-hum ’ad-dars-a 

students-3PL.MASC wrote all- 3 pl.masc the-lesson 

“Students wrote all the lesson.” 

Another piece of evidence that supports our analysis of the marked FQ kull and 
its NP antecedent as topicalized construction is that TOPICS cannot be questioned or 
focused. This means that the TOPIC in (20) is not the subject of the main verb in the 
sentence; the real subject of the verb is the pronominal clitic attached to the Q kull. 
This can be further shown by the fact that subjects can be questioned as shown in 
(22), whereas topics cannot be questioned, as illustrated in (23): 



(22a) 


[ 'at-tulla:b subj] 


katab-u: 


’ad-dars-a 






the-students-3PL.MASC 


wrote 


the-lesson 






“The students wrote the lesson.” 






(b) 


[mann focus] 


ta- ‘taqid 


katab-u: 


’ad-dars-a 




who 


you-think 


wrote 


the-lesson 




“Who do you think wrote the lesson?” 






(23a) 


[ 'at-tulla:b- topic] 


katab-u: [kull- 


u-hum- subj] 


’ad-dars-a 




the-students-3PL.MASC 


wrote all- 3 pl.masc 


the-lesson 



“The students all wrote the lesson.” 
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(b) *[man-TOPic/FOCUS] ta ‘taqid katab-u: kull-u-hum 

‘ad-dars-a 

who you-think wrote all-3PL.MASC 

the-lesson 

“Who do you think all wrote the lesson?” 

There is a function conflict in (23b); that is why the sentence is ungrammatical. 
One entity cannot be a topic and a FOCUS at the same time. A topic is considered old 
information, whereas a FOCUS is new information. Such evidence leads us to say that 
the NP ’at-tulla:b in (20) does not function as the subj of the sentence. Because the 
NP 'at-tulla:b and the marked FQ kull-u-hum in (20) refer to the same entity through 
coindexation, we can say that the NP ’at-tulla:b functions as the topic not the subj 
of the sentence. 

The Syntactic Differences between the 
Unmarked NP ad jQ and the Marked FQ 

This section highlights the syntactic and semantic differences between the marked 
and unmarked FQ constructions. Again, we believe that the two constructions are 
semantically different, and thus they are syntactically distinct. Such difference can 
be illustrated within the LFG framework as follows. 

The LFG Representation of the Unmarked NP al jjQ Construction 

The representations in (24) show the lexical, c-structural, and f-structural configura- 
tions of the unmarked NP ad jQ constructions according to the LFG framework: 

(24a) The Lexical Entry of NP adj Q: 

kull Q: pred “hull ((| obj))” 

(| obj num) = c PL 

(t OBJ DEF) = c + 

(b) The C -Structure Entry of NP adj Q: 
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PRED 'kul <| OBJ>' 

DEF + 

OBJ PRED 'ta: lib’ 

NUM PL 

GEND MASC 




kull-u-hum 



fi-l-bayt-i 
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(c) The F -Structure Entry’ of NP adj Q: 
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According to the LFG convention, only verbs and prepositions subcategorize for 
(| OBJ). This is not the case illustrated in (24c). In this chapter we have presented 
an argument that considers the Arabic universal Q kull the head of QP (cf. examples 
5 and 6). Because heads take complements, this should be expressed in the f-structure 
in the LFG representation. This view is also supported by Fassi-Fehri (1988), whose 
analysis of Arabic quantifiers allows them to take complement NPs as OBJs. 

The LFG Representation of the Marked FQ Construction 

The representations in (25) show the lexical, c-structural, and f-structural configura- 
tions of the marked FQ constructions according to the LFG framework: 

(25a) The Lexical Entry’ of Marked FQ: 

kull FQ: PRED “kull (Q obj))” 

(t OBJ PRED) = “PRO” 

(b) The C-Structure Entry of Marked FQ: 

(c) The F-Structure Entry’ of Marked FQ: 

As can be seen in (25c), the topic, 'at-tulla:b, and the subj, the Q kull-u-hum, 
are anaphorically bound. This binding relationship between the subj and its topic is 
indicated by coindexation in the f-structure. Again, this LFG analysis overcomes all 
the disadvantages that accompany both the transformational/derivational and adver- 
bial analyses discussed in the third section. The LFG analysis proves that the un- 
marked and marked FQ constructions are two different structures; therefore, there is 
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no “floating” involved as claimed by the traditional nonlexical approaches discussed 
in the third section. 

Conclusion 

This chapter has investigated the syntax and semantics of Arabic universal quantifi- 
cation from three different perspectives. The first one is the nonlexical transforma- 
tional approach that considers the Arabic universal Q hull and its two different 
structures, which we call the unmarked “NP ad jQ” and the marked “FQ,” as base- 
generated and that the marked construction is a “floated” quantifier. The second 
approach considers the marked FQ construction an adjoined adverb and does not 
posit any transformational link between the marked FQ and the associated DP nom- 
inal constructions. It considers the Q as an NP adjunct. Still, it fails to explain the 
existing full agreement features of the pronominal clitic in the Q hull with its anteced- 
ent DP/NP. 

The third approach, the lexical-functional one proposed in this study, takes the 
two quantified structures to be semantically different; accordingly, they show differ- 
ent syntactic constituencies. It shows that there is no movement involved in the 
derivation of the marked FQ construction. It also argues that there is no “floating” 
involved in deriving the marked and unmarked quantification constructions discussed 
in the current study. 

Finally, the LFG analysis adopted in this study treats the marked or so-called 
FQ as an instance of topicalization. In this lexical analysis of the marked FQ, the 
topic function is identified by its anaphoric binding and coreference with the subj 
function represented by the pronominal clitic attached to the Q kull in the subject 
position, a requirement that satisfies the ECC in the LFG framework. Whenever such 
anaphoric binding is absent, that is, the so-called floated Q kull does not contain such 
pronominal clitic that agrees with the topic in gender, number, person, and case, this 
topic will not be identified, and thus the ECC will be violated. If this occurs, such 
constructions will be judged ungrammatical. 

We could say that some findings of this study are truly universal phenomena. 
The discussion of the links between marked FQ type and movement type, and the 
link between the semantic type and the FQ type, are quite robust and should be ap- 
plicable in studies of movement and the interaction of syntax and semantics more 
broadly. The current debate regards the proper analysis of so-called floating quanti- 
fication. That is, one can analyze quantified constituency splits as derived either 
through movement transformation, through adverbial adjunct, or through some non- 
transformational mechanism, as shown by the adoption of LFG framework. Still, 
these quantified constituency splits clearly provide a ground ripe for further research 
along the arguments we adopted in this study. The current chapter does not claim to 
examine all so-called floated positions, for example, at the end of verb phrase object 
complement, as in (26), or at the sentence-final constructions, as in (27): 

(26) Ja:had-tu ’at-tulla:b-a kull-a-hum 

saw-I the-students-3PL.MASC.ACC all-3PL.MASC.ACC 

“I saw the students all.” = “I saw all the students.” 
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(27) ’at-tulla:b-u katab-u: ’ad-dars-a kull-u-hum 

the-students-3PL.MASC.NOM wrote the-lesson all-3PL.MASC.NOM 

“The students wrote the lesson all (of them).” 

Further work is needed to illustrate and analyze these constructions from both 
syntactic and semantic perspectives. Finally, other topicalization and/or focus con- 
structions in Arabic need to be further studied in order to reveal all their properties 
that can shed more light on the marked and unmarked FQ constructions examined in 
the current study. 
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WITH THE INCEPTION of the digital age and, in particular, the widespread adoption of the 
Internet as a communication tool and as a medium for information exchange, the 
amount of information available to the public has grown exponentially, although the 
tools for processing and extracting meaning from this enormous body of information 
have only grown linearly. To address these pressing needs, computational linguists 
have developed three main approaches to natural language processing (NLP): the 
statistical approach; the symbolic approach; and the hybrid approach, which com- 
bines features of both the statistical and symbolic approaches. 

In this chapter I present the history and progress of NLP beginning with the 
introduction of the digital computer in the late 1940s, to the rise of the Internet, which 
resulted in a massive explosion of information and the dominance of the digital 
format of communication. The abundance of information stored in electronic format 
required computational tools for information processing, retrieval, and extraction. 
Because Arabic is a major world language, both Arabic speakers and international 
entities with interest in the Arab world have the desire to develop tools for the 
analysis and processing of Arabic data. 

Here I first present a brief description of the properties of the Arabic language 
that are crucial to Arabic Natural Language Processing (ANLP). Then I focus on the 
development of symbolic and statistical paradigms for the processing of natural lan- 
guage. I discuss these paradigms in the context of theoretical and practical consider- 
ations for developing Arabic machine translation systems. My conclusion is that 
whereas statistical approaches to ANLP seem to be more successful from a native 
Arabic perspective, NLP approaches that promote rigorous analysis of the Arabic 
language could better meet the need for Arabic information processing and also 
satisfy other important sociocultural needs. 

NLP in the Digital Age 

From its earliest development in the 1940s, the computer was hailed as an innovation 
that would facilitate and promote the development and dissemination of knowledge. 
However, widespread adoption as anticipated by the nascent information technology 
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industry was constrained by the technology because the expensive mainframe com- 
puters available at the time were affordable only to governments, academic centers, 
and the largest corporations. 

But the advent of the personal computer in the 1980s created a paradigm shift 
in the information technology industry because the tools for both creating and pro- 
cessing information in digital formats were now available to small and midsize enti- 
ties and even to individuals. Computers were becoming smaller, cheaper, and more 
powerful with processing power and data storage capabilities (Gazdar and Pullum 
1985). However, the utility of the newly affordable computer was not obvious to the 
nonspecialist. Only in the 1990s did the development and rapid adoption of the In- 
ternet worldwide create a second paradigm shift that enabled individuals to create 
and distribute information to and from the most remote corners of the world, usher- 
ing in what is now called the Digital Age. The enormous quantity of documents 
created and stored in digital format every minute has grown from kilobytes to mega- 
bytes to gigabytes and is now estimated to be several terabytes. 

However, the glut of information enabled by the Internet has created a simulta- 
neously pressing need for NLP tools to process, classify, and extract meaning from 
the huge unstructured data on the Internet. In a study undertaken by the University 
of California, Berkeley (Layman and Varian 2003), it was estimated that as recently 
as 2002, 92 percent of the world’s information was available and/or stored on mag- 
netic tapes. But it is now estimated that millions of documents are created daily in 
sizes ranging from kilobytes to terabytes. E-mails alone account for 400,000 terabytes 
per year, and new social networking applications such as instant messaging create 5 
terabytes daily. It is also estimated that 40 percent of the world’s newly stored infor- 
mation is created in the United States. Thus, much of this new information is created 
in English, but multilingual applications are badly needed to make new information 
accessible to speakers of other languages. 

Although information is considered a path to prosperity and a means to obtain 
power, the glut of information could lead instead to a poverty of knowledge when 
governments, academia, and industry simply lack the means to process this informa- 
tion efficiently and in a timely manner. The challenge is that information is encoded 
in natural language, yet the necessary human expertise is neither sufficient nor avail- 
able to process terabytes of information. The world has become a global village that 
requires localization and applications that can transcend linguistic and cultural 
boundaries. What, then, is the solution? 

Computational linguistics offers a feasible solution with applications that span 
information retrieval, information extraction, question-answering systems, speech 
recognition, text summarization, and sentiment analysis. Clearly, machine translation 
is required to process this vast amount of information in multilingual documents and 
hence “democratize” knowledge in the global age. In the following section 1 sum- 
marize the challenges that the Arabic language poses in ANLP development. 

The Arabic Language 

Arabic, which is the world’s sixth most widely spoken language and one of the hand- 
ful of languages in which new information is created, presents computational lin- 
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guists with specific challenges. First, the language itself poses challenges (Farghaly 
and Shaalan 2009) because the diglossic situation of Arabic requires that any system 
recognize and decide beforehand which variety of Arabic it is addressing. Is it Clas- 
sical Arabic, Modern Standard Arabic (MSA), or one of the various Arab dialects 
(Levantine, Egyptian, Gulf, Iraqi, Maghrebi, etc.)? And it is even more complex than 
that. Recent research in Arabic sociolinguistics (Bassiouney 2009; Eid 2007) shows 
that contemporary Arabic is characterized by mixing levels. Arabic speakers do not 
usually adhere to one variety of Arabic when they speak. They tend to use forms that 
belong to more than one variety at the same time, which adds complexity to the task 
of formulating algorithms for information extraction, retrieval, speech processing, 
and the like. 

Among all varieties of Arabic, Classical Arabic is the most prestigious, and it 
has been used in formal situations for the last fourteen centuries. However, despite 
its stability over 1,500 years, Classical Arabic is neither the native nor spoken lan- 
guage of any group; nor is it the language of contemporary writing. 

MSA, which evolved from Classical Arabic in the nineteenth century through 
contact and influence from the West, is the lingua franca of the Arabic-speaking 
world today, and is used primarily in business, the media, education, diplomacy, and 
so on. However, it has not been fully described. Further, it is not the native language 
of anyone and is learned in school like any foreign language. 

The dialects of Arabic, however, are acquired naturally and are the languages 
spoken daily among family, friends, and the community in general. Although they 
are well known by their speakers, they have not been entirely formally described by 
linguists. Further complicating the situation is the language mixing that commonly 
occurs in speech. 

The challenges for computational linguists then are these: Which language 
should be the focus of NLP? And how do they fully analyze MSA and the dialects? 

The Arabic Script 

The Arabic script presents a second challenge for computational linguists and NLP. 
Unlike languages like English, there is an accurate representation of phonemes in 
Arabic; that is, one-to-one correspondence of sound to letter. However, certain fea- 
tures of the script create ambiguity, unlike English. The absence of explicit case 
markings in most MSA texts creates multiple ambiguities, which makes it difficult 
to distinguish between subject, object and the relationship of the resumptive pronoun 
to its antecedent. The absence of both capitalization and strict rules of punctuation 
complicates the tasks of information extraction and retrieval and establishing phrase 
boundaries. But the lack of internal voweling in most MSA texts is perhaps the great- 
est source of ambiguity. As a result, three types of information are lost due to the 
Arabic script: 

1 . Case assignment: Arabic, with its relatively free word order, uses three case 
markers to define the grammatical function of a word. But with the deletion of 
case markers in written and frequently in spoken MSA, it is difficult to deter- 
mine the grammatical function of nominal expressions. 
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2. Homograph information: The absence of internal voweling makes it difficult 
to determine the part of speech of a word without contextual clues. For exam- 
ple, the word could be the equivalent of the preposition “from”; the 
wh-word, “who”; or a verb meaning to “grant, bestow upon.” 

3. Word sense: Even words that are not homographs could present a challenge 

because it is difficult to distinguish between the different senses without inter- 
nal voweling. A case in point is the word which could mean either 

“leg” or “man.” 

Ambiguity 

In addition to problems posed by the script, other features of the Arabic language 
contribute to ambiguity, as shown in the following subsections. 

Word Segmentation 

Arabic words can be segmented in different ways, which makes it difficult to deter- 
mine exactly how a word should be segmented (Sorensen and Zitouni 2010). As an 
example, could be segmented in three ways, yielding significantly different 

meanings: 

(la) conj + noun + possessive pronoun (and my worry) 

(b) noun + adjectival marker (imaginary) 

as “wahmy” (illusory, false) or as “wahm+y” (my imagination) 

(c) /’ noun + possessive pronoun (my illusion) 

The problem of Arabic word segmentation (Benajiba and Zitouni 2009) consti- 
tutes one of the most formidable challenges in Arabic NLP. In rule-based applica- 
tions, linguistic knowledge is used to decide how an Arabic word could be segmented 
and to identify the correct part of speech for each morpheme. Such linguistic knowl- 
edge makes the use of contextual information as well as linguistic rules that define 
which morphemes may concatenate with which stem. In machine learning applica- 
tions, a training data set with each word segmented correctly by human annotators 
is needed. For example, in a pioneering work on Arabic word segmentation (Lee et 
al. 2003), a training set of about 110,000 Arabic words segmented by human experts 
was compiled and a trigram language model was created. A corpus of 155 million 
words was segmented using unsupervised learning. The results were repeatedly re- 
fined using a trigram language model to segment the 155 million words into stems, 
prefixes, and suffixes, and the model achieved 97 percent agreement with human 
annotators. Building on this work, Sorensen and Zitouni (2010) exploited the capa- 
bilities of finite state models (Beesley and Karttunen 2003) for Arabic word segmen- 
tation. They report in their work that whereas words that were seen previously in 
their training data and/or their stem and word dictionary were segmented correctly, 
unknown words were not segmented at all. Thus they added a unigram character- 
based model, which was more effective in segmenting words that were not seen in 
the training data. 
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Prepositional Attachment 

Like many other languages, Arabic allows prepositions to attach to nouns or to verbs, 
which creates syntactic ambiguity. In the following example, the Arabic prepositional 
phrase should attach to the noun phrase and the preposition J li should be translated 
as “of’: 

(2) .» li UbCt"! 

“I saw an ad of the Egyptian company” 

In contrast, in the following sentence, the prepositional phrase is attached to the 
verb phrase and the preposition J li is translated as “to” and not as “of as in the 
preceding sentence: 

(3) li SSjL. JU Lib o-a-d 

“I submitted an application to the Egyptian company” 

Thus, although the surface structure of the two sentences is identical, an Arabic 
machine translation system needs to distinguish between the two and give the correct 
translation of the preposition in each case. One possible solution is that the coding 
of the verb “f-tS” would be associated with the preposition “J,” whereas the coding 
of the verb “A»Li” is not associated with the preposition. 

There are also cases when the preposition could attach to either the verb or the 
noun phrase depending on the intended meaning. An example of such an ambiguous 
sentence is the following: 

(4) J 

“I decided on traveling in March” 

The ambiguity here lies in the attachment of the preposition “,i,” meaning “in.” 
If it is attached to the verb, the decision on traveling was made in March. But if it is 
attached to the noun phrase ji~Jl al-safar “traveling,” then traveling will be in March 
whereas the decision to travel could have been made at any time. 

Constituent Boundaries 

The ambiguity here lies in the boundary of the adjective phrase within this noun 
construct. That is, does “new” modify “manager” or does it modify “bank” — leading 
to two different interpretations: 

(5) Ld-eJi cLjUs 

“I met the new manager of the bank” or “I met the manager of the 

new bank” 



Semantic Ambiguity 

As in any language, sentences and phrases may be interpreted in different ways. 
For example: 
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Does this mean “Ali likes Ahmed more than he likes Samira”? Or does it mean 
“Ahmed likes Ali more than he likes Samira”? Or does it mean “Ali likes Ahmed 
more than Samira likes Ahmed”? 

Pronominal Ambiguity 

Because Arabic is an agglutinative language, it permits pronoun attachment to verbs, 
prepositions, and nouns, depending on the syntactic relationship. If attached to a verb, 
it is the object of the verb; if attached to a preposition, it is the object of a preposition; 
and if attached to a noun, it indicates possession. But Arabic also uses the resumptive 
pronoun, which must agree in gender and number to refer back to a noun if it has 
been previously mentioned. Therefore, in a sentence like the following, it is difficult 
to determine whether the pronoun (-*) is referring to the subject ( “the minister”) 
or the direct object ( Jl, “the journalist”): 

As a result, sentence (7) could be translated as either “The ministeri met with 
the journalist who criticized himi” or as “The ministeri met with the journalist whom 
hei criticized.” 

Pro-Drop Ambiguity 

In Arabic, subject pronouns may be dropped (Eid 1980; Farghaly 1982) subject to 
the Recoverability of Deletion Condition (Chomsky 1965). This property of option- 
ally dropping subject pronouns and allowing subjectless sentences is not limited to 
Arabic but is also found in other languages such as Italian, Spanish, and Korean, to 
name but a few. The Pro-Drop property of Arabic is challenging in Arabic because 
of the verb-subject-object (VSO) word order, the absence of short vowels, and the 
relatively free word order. For example, in most languages that allow subject pro- 
nouns to drop, a machine translation system can translate passive and active verbs 
correctly because there is an explicit difference between the two that shows usually 
in the form of inflection. This is also true of a small set of Arabic verbs called “hol- 
low verbs.” For example, an active hollow verb in Arabic is clearly distinguished 
from its corresponding passive form, such as Jli, “he said,” and JJ, “it is said.” For 
most other Arabic verbs the task of distinguishing between an active and a passive 
verb is complicated because of the lack of internal voweling. Consider the following 
sentence, where the pro-drop feature contributes to the ambiguity: 

(8) skJl oiSi 

A machine translation system needs to select the proper translation from — at 
least — the following readings: 



(9a) I/she/you ate the duck. 

(b) The duck was eaten. 

(c) The duck has eaten. 
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The reading in (9a) assumes that the verb is an active verb, and that the sentence 
has a subject pronoun that is dropped, a verb, and a direct object. However, because 
of the lack of internal voweling, the subject pronoun could be first-person singular, 
third-person feminine singular, or second-person singular. The reading in (9b) is 
based on the analysis that it is a passive sentence and that the grammatical subject is 
not dropped. The third translation also assumes that the lexical noun phrase here is 
the subject and therefore there is no pro-drop. 

The ambiguity of the verb as an active or passive verb does not exist in spoken 
Arabic because in speech internal voweling disambiguates. Much research on Arabic 
NLP (Elshafei, Al-Muhtaseb, and Alghamdi 2006; Habash and Rambow 2007; 
Maamouri, Bies, and Kulick 2006; Zitouni, Sorenson, and Sarikaya 2006) has fo- 
cused on automatic diacritization of written texts, which thus reduces the level of 
ambiguity in written Arabic texts. 

Arabic Morphology 

Arabic, like other Semitic languages, is characterized by a complex and rich morphol- 
ogy. Traditional grammarians recognized the root as the basic underlying form of 
Arabic words and described Arabic morphology in terms of patterns (Ojjyl). The 
building blocks of Arabic surface stems are the consonantal root, which represents a 
semantic field like {KTB } “writing,” and the vocalism that represents a grammatical 
form. The process of combining the root and the vocalism to form surface words 
forces both the elements of the root and those of the vocalism to be discontinuous. 
Thus most Arabic surface words are characterized by a discontinuity at the morpho- 
logical level. McCarthy (1981) assigns different tiers to the vocalism and the conso- 
nantal root and describes the way they combine to form surface words. Hence, the 
high inflectional property of Arabic is a result of the fact that a relatively small 
number of consonantal roots (6,000 roots), when combined with a vocalism convey- 
ing grammatical form, yield hundreds of thousands of words. 

Therefore, with respect to the internal morphology of Arabic, consonants and 
vowels have specialized roles, unlike languages such as English and French. Conso- 
nants represent a field of meaning, whereas short vowels represented with diacritical 
marks carry grammatical meaning such as tense, voice, and case. For example, the 
pattern a-a-a (CJs' “he wrote”) is the past tense / active voice, whereas the same root 
with a pattern u-i-a- “it was written”) would be the past tense / passive voice. 
Therefore the loss of diacritical marks in written MSA text has profound implications 
for computational linguistics because the meaning of a phrase or word becomes dif- 
ficult to translate accurately or becomes ambiguous when the grammatical function 
is unclear. 

Vowels also play a role in dialectical differences. Although the consonants and 
consonantal order remain unchanged among dialects, short vowels do change within 
a word, while the meaning remains unchanged. For example, “huna” — the word for 
“here” in MSA — becomes “hina” in the Egyptian dialect and “hena” in the Gulf 
dialect. This too poses challenges for NLP and speech recognition applications. 

External morphology, or how affixes are attached to a stem or root, is also rule 
governed. In an agglutinative language like Arabic, affixes representing different 
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parts of speech can be conjoined together with a root or stem to form a token that 
has a syntactic structure. For example, the token fcjxJb “in the city” is a prepositional 
phrase that has a stem which is a noun, and two prefixes: The first is v “in,” 
a preposition, and the second is Jl “the,” a definite article. Thus external morphology 
describes the way affixes that represent different parts of speech are attached to 
Arabic stems and the order of attachment is rule governed. 

Arabic as a Nonconfigurational Language 

One of the challenges in Arabic computational linguistics is that Arabic, unlike Eng- 
lish, is not a configurational language, whereby the subject and direct object occur 
at two levels in the syntactic structure of a sentence. The subject is dominated at the 
S node, whereas the direct object is usually dominated by the verb phrase and at the 
surface level they are separated by the verb and their boundaries are well defined. 
This is not the case in the VSO sentences in Arabic. The subject and object are at the 
same level dominated by the verb phrase, and there is nothing that separates the 
subject and object. Hence, it is difficult in such sentences to determine where the 
subject ends and the object begins, which also makes it difficult for the computer to 
distinguish these syntactic boundaries: 

(10a) OlSjSfl ( _ r -5) -Cis 

criticized-he Obama chief of staff 
“Obama criticized the chief of staff.” 

(b) oisjSli Jux- 

resigned-he MacChrystal chief of staff 
“McChrystal, the chief of staff, resigned.” 

The correct analysis of (10a) is that the verb is followed by two separate con- 
stituents. The first noun phrase is “Obama,” the subject; and the second is the “chief 
of staff,” which is the direct object. In sentence (10b) the verb is instead followed by 
only one constituent, which is the subject. 

Symbolic NLP 

Symbolic NLP, also referred to as rule-based NLP, is an approach to developing NLP 
applications that relies primarily on the linguistic description of a language. The 
basic assumption is that it mimics the knowledge that native speakers have when 
they translate a text or understand speech and written language. Thus, symbolic NLP 
has often been used to test the validity and accuracy of grammars that linguists de- 
velop. The first NLP programs were developed using symbolic NLP. Only recently 
has another approach, “the statistical paradigm,” gained acceptance. This paradigm 
makes use of the availability of corpora and has devised machine learning algorithms 
meant to “learn” the linguistic rules from the data rather than relying on linguists to 
do the job. In the next section I discuss the early machine translation systems that 
were developed following a symbolic approach. 
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Early Machine Translation Systems 

The invention of the digital computer in the 1940s inspired scientists to think of us- 
ing the unprecedented speed of the computer to translate texts from one language to 
another. Having been so inspired, scientists started to take practical steps to realize 
the dream and vision of Descartes, who wrote in 1 629 about a mechanical process 
to convert one human language to another. In 1 949, Warren Weaver, the pioneer of 
machine translation, wrote a memorandum to his colleagues making four proposals 
for machine translation systems that go beyond word-for-word translation. 

Weaver realized that many words in a language were ambiguous, and he pro- 
posed in his memorandum to solve this problem by examining the immediate context 
of the ambiguous word (Hutchins 2000). He also drew attention to the analogy be- 
tween the structure of the human brain and the “logical machine.” He concluded that 
the machine translation problem was solvable. He also suggested using the crypto- 
graphic methods that linguists used in World War II for deciphering German secret 
code. These methods relied heavily on frequencies of letters, combinations of letters, 
and letter patterns. He also believed that underlying the statistical regularities of 
languages there is a logical and universal foundation, which could represent an al- 
ternative to translating from one language to another. This idea was implemented in 
1980s in machine translation systems utilizing an interlingua approach. 

With the beginning of the Cold War in the 1940s, there was an urgent need for 
machine translation because the United States decided it was essential to scan and 
interpret every Russian communication coming out of the Soviet Union. However, 
there were not enough translators to keep up with the huge volume of Russian books 
and papers published in the Soviet Bloc at that time. The urgent need to translate 
Russian into English coincided with the invention of computers. It was not surprising, 
then, that developing Russian-to-English machine translation systems would be one 
of the first tasks these “miracle” machines were set to perform. 

The first demonstration of the feasibility of fully automated machine translation 
took place in New York on January 7, 1954. On that day, Georgetown University and 
IBM demonstrated the first nonnumerical applications and capabilities of the “new” 
electronic brain by demonstrating a fully automated Russian-English machine trans- 
lation system. The system embraced the commonly held view that a language con- 
sisted of a lexicon and a finite set of rules that could generate an infinite set of 
sentences. Surprisingly, the first Russian-to-English machine translation system had 
only 250 words and 6 syntactic rules. This experiment raised high expectations that 
probably within five years machine translation systems would be readily available. 
The promise was to develop a system that did not require preediting of the input 
while producing a reliable translation of the input text in the target language that was 
clear and intelligible, and required only stylistic modifications. At the time no details 
were given about the workings in the system. For example, no information about 
dictionary content and lookup procedures was given, and there was no account of 
how the syntactic analysis of the Russian sentences was performed and how the 
target English structure was selected. However, there were some references to revers- 
ing the order of pairs of sentences by assigning rules to the lexical items involved. 
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Later, a more detailed description of the system was presented by Garvin (1967), 
who gave a more detailed description of the dictionary. For example, the dictionary 
entries were sometimes stems, endings, or full words. Each entry was associated with 
three codes: the first code indicated which of the six syntactic rules applied, the 
second code determined which contextual information was needed to determine the 
target translation, and the third code indicated whether words were to be inverted. 
At the technical level the system represented the first attempt at nonnumerical pro- 
gramming, which presents developers with many challenges. Developers had to deal 
with character coding of Russian and how dictionary entries were to be stored, what 
lookup procedure would be followed, and how the syntactic rules would be coded 
and executed. 

The Georgetown-IBM experiment and similar other work at the time were sig- 
nificant for four main reasons. First, it was demonstrated that the digital computer 
could perform nonmathematical tasks such as machine translation. The system took 
advantage of the speed of the computer relative to human translators. Second, it was 
shown that the computer surpassed humans in that it would never forget, could work 
continuously without getting tired, and would never ask for a raise or a vacation! 
Third, the system demonstrated the need to specify and describe linguistic structures 
at different levels, such as the lexical and syntactic levels. And fourth, ambiguity of 
language was understood to be a problem, although it was underestimated. 

As Hutchins (1995) states, however, the period from 1956 to 1966 was “a decade 
of high expectations and disillusions.” The promise to deliver a fully automated 
machine translation system, with no preediting and only stylistic postediting with 95 
percent accuracy, was never achieved. Serious research showed that language struc- 
ture was much more complex than previously thought and that translators use huge 
amounts of linguistic, domain-specific, real-world, and commonsense knowledge 
that was not considered relevant at the time. The ALPAC (1966) report concluded 
that machine translation was not viable given the state of knowledge at the time. 
Consequently, funding for research on machine translation was halted in the United 
States and did not resume until the middle and late 1970s. 

The First Arabic Machine Translation System 

The first English-to-Arabic machine translation system was developed in the late 
1970s by Weidner Communications Inc. in Provo, Utah, and was released in 1982. 
This system was developed following the direct method, and it was very ambitious 
in its objective, which was “to produce fully automated Arabic translations of unlim- 
ited English source documents in unrestricted domains.” There was no preediting 
module, although it included a module for postediting if desired. The postediting 
module included presenting the source text and its translation on two different win- 
dows on the same screen. This allowed the posteditor to view the English and Arabic 
texts side by side. Special buttons were provided as a shortcut for frequent postedit- 
ing operations. For example, there was a button to swap two words, which was a 
common task because the syntactic component was not very deep. Although the 
Weidner English-to-Arabic system claimed that postediting was not necessary, it was 
needed more often than not. It is also worth noting that although the system did not 
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need a preediting module, a lot of work needed to be done on the dictionary to ensure 
that the translation was of a reasonable quality. For example, entering the following 
English sentence as an input to the system “Interest rate rose by 1 percent” produced 
the following absurd Arabic translation: 

(11) 1 ibjj percent 

“rate of interest flower 1 percent” 

This does not give even the gist of the English sentence. However, when the 
dictionary manager adds the entry “interest rate” as an idiom with the Arabic equiv- 
alent 5JUU3I jju, and assigns priority to “rose” as a verb over “rose” as a noun, the 
machine translation system would yield the following Arabic translation for the same 
English sentence: 

(12) 1 j! percent 

Though not perfect, this is a reasonable translation. Much dictionary work 
needed to be done to prepare for the translation of each document. 

As in all other machine translation systems that adopted the direct method of 
word-to-word translation, it was designed for a specific pair of languages: English 
as the source language and MSA as the target language. The system consisted of two 
main stages: analysis of the source language and generation of the target language. 
The analysis of English was oriented to enable the correct generation of target lan- 
guage expressions employing a large bilingual dictionary as well as a dictionary for 
idiomatic expressions. The syntax of English was not analyzed in depth and only to 
the extent required to generate Arabic equivalents. It was meant to generate the right 
Arabic word order and correct agreement. However, it did not adhere to this consis- 
tently. Thus the system was unidirectional and did not perform deep syntactic or 
semantic analysis of the source language. 

The system was commercially utilized by Omnitrans of California Inc., which 
used it for the purpose of translating the Micropedia edition of Encyclopedia Britan- 
nica into Arabic. With the sharp rise of oil prices in the late 1970s, oil-producing 
Arab countries were suddenly blessed with enormous wealth. Many Arab countries 
wanted to invest the newly accumulated resources in an ambitious process of mod- 
ernization, and education was the focus of their interest, so many schools, universi- 
ties, and research centers were built. Omnitrans of California noted this and also 
noticed that in spite of the expanding population of the Arab world and the investment 
in education, the Arabic people did not have a single source of knowledge such as 
Encyclopedia Britannica in their native language. Omnitrans thought it was an excel- 
lent business opportunity and would be a contribution to the Arabic language to offer 
an Arabic version of the Encyclopedia Britannica. Thus, it acquired a contract from 
Encyclopedia Britannica to allow it to publish an Arabic version of the Micropedia 
edition of Britannica, which in 1983 came in twenty volumes and contained 20 mil- 
lion words. Omnitrans acquired the Weidner English-to-Arabic machine translation 
system to help expedite the translation of the twenty volumes of the Micropedia. 
However, this project was not completed due to a lack of funding. 
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Another user of the Weidner English-to-Arabic machine translation system was 
the Sultanate of Oman, which used it to translate official English documents into 
Arabic. In this capacity, the author can attest that it was possible to get reasonable 
output from the system by manipulating dictionary entries targeting specific domains. 
Not surprisingly, the quality of the translation of scientific texts was much better than 
those of general English texts. This phenomenon was common to all machine trans- 
lation systems at the time, which gave rise to the notion of restricting some systems 
to narrowly defined domains. In a well-defined domain, ambiguity is reduced by 
eliminating readings that do not belong to the domain of interest. This realization 
gave rise to systems that are built to translate a sublanguage (Kittredge and Lehr- 
berger 1982), which is usually defined as “a subvariety of language used in a par- 
ticular field or by a particular social group and characterized especially by distinctive 
vocabulary and syntax.” However, further development of the Weidner English-to- 
Arabic system stopped shortly after the company was acquired in 1984. 

Problems with the Direct Approach 
to Machine Translation 

The direct machine translation approach did not rely on deep linguistic analysis of 
the source language. It involved superficial manipulation of the word order of the 
source sentence to make it look more similar to the order of the target language. Ac- 
cordingly, machine translation developers and researchers soon realized that the di- 
rect method could not deal with the complexity of natural language. For example, 
what was thought to be a simple swapping operation of switching the subject and 
verb from a subject-verb-object (SVO) to a VSO structure turned out to be very 
complex. The translation of an English sentence like “John loves Mary” into Arabic 
involves switching the order of the subject, John, to come after the verb in Arabic to 
become ‘VjL* However, when the subject of the English sentence be- 

comes complex, as in “The tall man, who was wearing a red tie and a white shirt and 
was speaking in an Italian accent with his guests, greeted us warmly,” identifying the 
length and boundary of the subject requires deep parsing. The direct approach would 
not yield accurate results for complex sentences like this because it did not incorpo- 
rate the required syntactic knowledge. There was also a need to develop the technol- 
ogy to efficiently perform deep parsing and to represent complex disambiguation 
rules. The transfer approach to machine translation, on which levels of syntactic 
representation are computed and themselves transferred, provided significant contri- 
butions on two fronts: the syntactic description of language, and a new technology 
for the representation and processing of deep syntactic parsing. In the next section 1 
describe the progress made in the 1970s and 1980s in linguistic theory. 

Progress in Linguistic Theory 

The second half of the twentieth century witnessed a paradigm shift in linguistics 
when Noam Chomsky (1957, 1965) challenged the well-established theory of struc- 
tural linguistics (Bloomfield 1933). Chomsky redefined the goals of linguistic theory 
to account for native speakers’ intuitions about their language rather than simply 
investigating a corpus and finding regularities in that corpus. He also challenged the 
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view held by structuralists that a child is bom with a “tabula rasa,” that is, with no 
knowledge of language at all. Structural linguists believe that it is through listening, 
imitating, and repetition that a child acquires the language of his or her people. 
Chomsky showed that comparing the linguistic knowledge that a child internalizes 
with the fragments to which he or she is exposed in early linguistic experience points 
to a gap that needs to be accounted for. The explanation that Chomsky offers is that 
a child is born with innate knowledge of “language”; that is, though he or she is born 
without knowledge of any specific language, nevertheless he knows what language 
is. In Chomsky’s terms he is born with Universal Grammar. For Chomsky, this is the 
only explanation for the uniformity and remarkable speed of language acquisition. 

Chomsky also challenged the structuralists’ position that in order to write a 
description of a language, they must obtain a corpus of the language and perform a 
“discovery procedure” to deduce the generalizations underlying the language. He 
argues that a corpus of native speakers’ utterances represents only the performance 
of the speakers of the language. Performance is usually affected by lapses of mind, 
change of plans, fatigue, distractions, and so on, so it is not always a true reflection 
of native speakers’ knowledge of their language. Speakers very often recognize the 
ungrammaticality of what they actually said and they have no problems correcting 
their errors. Thus, depending on the corpus alone could produce the incorrect gram- 
mar. Further, Chomsky argues that native speakers can easily understand and/or say 
sentences they have never heard or said before. A grammar needs to reflect this 
creative property of human language, which differentiates it from other systems of 
communication. Thus, Chomsky argues that a linguist should aim at describing the 
speaker’s mental grammar by eliciting his intuitions. He makes a fundamental dis- 
tinction between competence and performance. For Chomsky, competence is the 
linguistic knowledge that a speaker has of his language, whereas performance is what 
he actually says — not always a true reflection of his linguistic knowledge. Chomsky 
(1965) states clearly that linguistic theory must be concerned with characterizing 
native speakers’ competence rather than performance. 

Transformational generative grammar (Chomsky 1965) had interesting implica- 
tions for computational linguistics and machine translation. First, because of the 
creativity of language, the grammar of a language must make a distinction between 
an infinite set of sentences representing what has been said and could be said in that 
language and ungrammatical sentences. Because languages are leamable, their gram- 
mar must be finite. Thus, a grammar consists of a finite set of rules that generates an 
infinite set of sentences. Hence recursion has become an important property of phrase 
structure grammar. Second, the grammar that is elucidated by the linguist should 
mimic native speakers’ intuitions, because Chomsky points out that those native 
speakers can recognize the different interpretations of an ambiguous sentence. For 
instance, native speakers of Arabic can assign at least two different interpretations 
of the following sentence: 

(13) cJjli 

Speakers of Arabic would recognize that (13) can be translated as either “I saw 
the new manager of the bank” or “I saw the manager of the new bank.” An adequate 




48 



Ali Farghaly 



grammar of Arabic must mimic native speakers’ ability to recognize the ambiguity 
of such a sentence by assigning two different structural descriptions to the sentence. 
Because ambiguity is one of the most challenging aspects of NLP, computational 
linguists should understand the relevance of generative grammar to their work. For 
example, an Arabic-to-English machine translation system must assign different 
structures to the seemingly identical sentences in (14) and (15): 

(14) b 1x5” cij I 

“I read a book by Chomsky.” 

(15) . abxL b 1x5" ‘ c- 1 

“I gave a book to Chomsky.” 

The problem here is known as the prepositional attachment problem, which is 
to identify when a preposition should attach to the verb or to the noun phrase. More- 
over, in Arabic the correct translation of the preposition in cases like this depends on 
the relationship of the preposition to the constituent it modifies. 

Progress in NLP Technology 

The progress in linguistic theory in the 1970s and 1980s, which started with Chom- 
sky’s argument for the lexicalist hypothesis over the transformational analysis of 
nominalization (Chomsky 1970), coincided with even more important progress in 
NLP technology. In 1970 chart parsing (Kay 1973) was developed, which was par- 
ticularly suited to parsing ambiguous context-free grammars such as grammars of 
natural languages. Chart parsing uses a dynamic programming approach and can be 
implemented in either a top-down or bottom-up parsing approach or a combination 
of the two. Chart parsing eliminates the need for backtracking and avoids parsing 
any input that has been previously successfully parsed. Before any new input is 
parsed, the parser looks into the chart to check if it has been parsed, so chart parsing 
reduces processing time. It also provides a compact way for representing local am- 
biguities. Several versions of chart parsing have been developed over the years, such 
as the Earley parser and the Cocke- Younger-Kasami algorithm. 

Another formalism that had a great impact on the progress in computational 
linguistics was the development of Definite Clause Grammar (DCG) (Pereira and 
Warren 1980), which represents grammars in definite clauses in first-order logic. 
Rules written in Definite Clause Grammar are similar to the phrase structure rules 
that linguists are used to. Here is an example of a fragment of a DCG grammar 
for Arabic: 

(16) sentence — > verb_phrase, noun_phrase, noun_phrase. 
verb_phrase — * verb. 

noun_phrase — * det, common_noun. 
det — ♦ [Y|. 
common_noun — » 
common_noun — > 

verb — * [ J-4L-A masc, sing]. 

verb — * [-u^L-, masc, sing] 
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The simple grammar given above generates and parses sentences such as the 
following: 

(17a) JJluM “The manager received the man.” 

(b) jj-uJl JJluM “The man received the manager.” 

(c) jj-uJl “The man assisted the manager.” 

(d) JarjJl jj-uJl -ts-L- “The manager assisted the man.” 

It is possible to augment DCG grammars with attributes such as agreement 
markers. For example, we can impose the rule that the verb in a verbal Arabic sen- 
tence has to be singular regardless of whether the subject is singular, dual, or plural, 
as seen below: 

(18) sentence — * verb_phrase(sing, GEN), noun_phrase(GEN), noun_phrase 

The above rule says that in an Arabic verb initial sentence, the verb has to be in 
the singular and forces gender agreement between the verb and the lexical subject. 

Unification-based grammars (Shieber 1986) represent an extension of phrase 
structure grammar. Unification grammar presents a grammatical model that relies 
heavily on feature unification. Several monostrata formalisms, such as lexical func- 
tional grammar and head-driven phrase structure grammar (HPSG), incorporated 
unifications in their computational models of natural language. The progress in lin- 
guistic formal descriptions of natural languages and the availability of advanced 
dedicated computational technology for NLP provided a stimulating environment 
that promoted what is known now as “deep parsing” and rule-based NLP. The 1990s 
witnessed a paradigm shift in NLP with the rise of statistical approaches in NLP 
after their success in speech recognition systems. Rule-based approaches have been 
heavily used in machine translation for many years, however, particularly in indus- 
trial applications. 

S US' TRAN A rabic-to-English 
Transfer Machine Translation System 

SYSTRAN, Inc., has been a pioneer in rule-based transfer machine translation for 
more than thirty years, focusing on developing machine translation systems for more 
than thirty languages using the transfer approach. The transfer approach to machine 
translation has three distinct stages: analysis of the source language; transfer of the 
structure of the source language to that of the target language; and the generation 
stage, which produces the target language. SYSTRAN is also recognized for its use 
of extensive dictionaries that annotate lexical items with morphological, syntactic, 
and semantic features. Because transfer machine translation systems are usually de- 
signed for specific language pairs, they can capitalize on the similarities between the 
source and target languages. They also use more sophisticated linguistic knowledge 
than that used in the direct method. 

The development of the SYSTRAN Arabic-to-English machine translation sys- 
tem began in San Diego in June 2002, initially with a small grant from the US gov- 
ernment. The author managed the project under the supervision of Jean Senellart, the 
director of research and development at SYSTRAN. The following subsection de- 
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scribes the development of this rule-based Arabic machine translation system begin- 
ning with the first phase — an Arabic gisting machine translation system. 

The Gisting Phase 

The Arabic translation funding agencies had an urgent need for a gisting system that 
uses unstructured, unvocalized Arabic documents as input and generates a word-for- 
word English translation, making it possible for someone with no knowledge of 
Arabic to intelligently guess the subject of the original Arabic document. This can 
be very valuable, especially when the user is faced with enormous amounts of Arabic 
texts; and, clearly, sorting potentially relevant from irrelevant documents saves both 
time and money. Further, it was required that the system be both fast and that cover- 
age should not be less than 95 percent. 

To meet these stringent requirements, SYSTRAN developed a monolingual 
Arabic stem-based lexicon and a bilingual Arabic-to-English dictionary. To expedite 
the process, it was decided to use Arabic stems rather than roots, which would elim- 
inate the step of generating stems from roots. Thus, each lemma is associated with a 
set of stems. For example, a lemma of an Arabic verb is associated with five stems: 
the perfect, imperfect, imperative, passive perfect, and passive imperfect. Lexicog- 
raphers were provided with the output of a guesser that generated all the required 
stems, with the additional requirement that the output of the guesser had to be vali- 
dated and corrected. A morphological generator was also developed to generate all 
the inflected forms of the lemmas in the dictionary. With these components, coverage 
at the end of the first three months of the project was 80 percent. Continuous testing 
on the A1 Jazeera website and Arabic newspapers plus entering new words in the 
dictionary increased coverage to 96 percent by December 2002. 

Internal and External Morphology 

The traditional Arab grammarians’ account of Arabic morphology in terms of roots 
and patterns is very precise and explicit, and since the 1980s, there has been extensive 
research on computational treatments of Arabic morphology (Yehia 1985; Geith 
1985; Beesley 2001; Attia 2005; Saudi, van den Bosch, and Neumann 2007). Most 
work on Arabic morphology aims to identify and separate the prefixes and suffixes 
from the surface word and recover the root or the stem that may have undergone 
morphophonemic changes. But this is not a trivial problem for a computer program 
to solve. SYSTRAN made a fundamental distinction between two kinds of affixes 
that can be attached to Arabic stems and/or roots. The first type is the affix that has 
only a grammatical meaning, such as subject-verb agreement markers, and tense or 
mood markers. These affixes are not part of the SYSTRAN dictionary but are gener- 
ated by SYSTRAN’S Arabic morphological generator, which takes as input the list 
of stems and their part of speech tags from the dictionary and generates all the surface 
forms that each stem could assume. The result is a runtime dictionary that has words 
as they actually occur in authentic Arabic unstructured texts. Examples of these af- 
fixes are the regular masculine plural markers j>, and the regular feminine plural 
ot. In SYSTRAN’S system, internal morphology is pivotal: It is where all different 
forms of one and only one stem are generated. 
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But with regard to external morphology, Arabic is an agglutinative language. 
Thus affixes representing different parts of speech can be conjoined together with a 
stem or a root to form a token that has a syntactic structure. For example, the token 
“in the city” is a prepositional phrase that has a stem “a city,” which is 
a noun, and two prefixes; the first is “in,” which is a preposition, and the second 
is Jl “the,” which is the definite article. Thus, external morphology describes how 
the affixes that represent different parts of speech are attached to Arabic stems and 
how the order of attachment is rule governed. SYSTRAN’S Arabic external morphol- 
ogy defines the syntax governing the agglutination of Arabic complex words; 
Farghaly (2003) provides specific examples of the Arabic internal and external mor- 
phological rules. 

Arabic Syntactic Analysis and Disambiguation 

The goal of the second phase of the SYSTRAN Arabic-to-English machine transla- 
tion system was to improve translation quality by introducing analysis, transfer, and 
disambiguation rules. Several rules for recognizing noun phrases and their boundar- 
ies were introduced with transfer rules to transform the Arabic noun phrase structure 
to the English structure. For example, a common Arabic noun phrase has the structure 
“Det Noun Det Adj,” as in Jss^Jl “the tall man,” which is transferred into the 
corresponding English structure “Det ADJ Noun.” 

Several analysis rules to recognize and correctly translate the Arabic genitive 
noun phrase known as the “ idaafa ” or “noun construct” were introduced. Similarly, 
several rules for sentence structure were introduced to transfer the common Arabic 
VSO structure into the SVO English word order. The implementation of the analysis 
and transfer rules, though limited, resulted in a marked improvement in transla- 
tion quality. 

Another improvement was achieved through homograph resolution and word 
sense disambiguation. In dealing with homograph resolution, we found that the most 
frequent homographic ambiguity was between noun/adjective (almost 90 percent of 
the observed ambiguities). This high degree of noun-adjective homograph ambiguity 
arises from the nature of the structure of the Arabic language. In Arabic, adjectives 
and nouns inflect in the same way. Like Arabic nouns, an Arabic adjective inflects 
for gender, number, case, and definiteness. It is not surprising, then, that traditional 
Arabic grammarians subsume adjectives under nouns and consider that there are only 
three main parts of speech in Arabic: nouns, verbs, and particles. SYSTRAN imple- 
mented contextual rules for homograph resolution. For example, a noun/verb ambi- 
guity is resolved as a noun if the ambiguous word or phrase is preceded by a 
preposition because Arabic does not allow prepositions to directly precede verbs. 

SYSTRAN also implemented contextual rules with look-ahead and look-back 
features for word sense disambiguation. For example, in the absence of diacritization, 
the Arabic verb jjj; could be translated as “visit” or “forge.” The word sense disam- 
biguation module would look ahead to see if it finds a noun with the feature “PLACE.” 
If so, the preferred translation would be “visit” rather than “forge,” because in real 
life you do not “forge a place.” It is more likely that you may visit a place. The word 
sense disambiguation module improved the quality of translation significantly. 
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The output of a rule-based machine translation system still suffers from a lack 
of fluency, however. A typical example of this lack of fluency is shown in figure 3.1, 
which displays the output of SYSTRAN’S Arabic-to-English machine translation 
(available online at www.systranet.com) of the following Arabic sentence: 

(19a) LkljLtj 

(b) Israel calls her citizens for departure of Sinai. 




Figure 3.1 SYSTRAN'S Translation of a Headline on the Al Jazeera Website, November 12, 2010 

There are several problems with this translation. First, although the system rec- 
ognizes that s in this sentence is a noun and not a verb, it fails to distinguish be- 
tween the usage of as an entity of type “organization” and the use of the same 
word as a verbal noun. Second, the verb in the English translation is singular, whereas 
the subject is actually plural in English and dual in the source sentence. Third, the 
verb in the target language is in the wrong position. 

Problems with Symbolic NLP 

Although symbolic Arabic-to-English machine translation showed significant suc- 
cess, the problems it suffered are shared with other symbolic NLP systems: 

1 . From toy grammars to wide coverage: Symbolic NLP systems usually show 
impressive results when developing prototypes or toy grammars, but once they 
go beyond prototyping, performance deteriorates. This is usually attributed to 
rule interaction. When the system encompasses a large number of rules, intro- 
ducing new rules interacts with other rules, which results in unexpected be- 
havior by the system. 

2. Explosion of ambiguity: Although speakers usually do not have problems with 
ambiguity, NLP systems have a very serious problem with ambiguity. This is 
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because humans have access to a huge amount of information that is not avail- 
able to NLP systems. First, native speakers in particular have perfect knowl- 
edge of their language, whereas grammars developed by linguists have never 
been complete. Second, humans have knowledge of the world, of the context 
as well as commonsense knowledge. Until it becomes possible to characterize 
such knowledge, make it explicit, and encode it in a format accessible to com- 
puters, ambiguity will always be a hindrance to developing high-quality sym- 
bolic NLP systems. 

3. Three hundred parses per sentence: Rule-based NLP systems have to consider 
a huge number of logically possible analyses. Thus the number of parses pro- 
duced by symbolic NLP systems becomes very large. The HPSG parser of the 
Lingo grammar produces an average of three hundred parses per sentence 
(Flickinger 2008). Considering all these parses and selecting the most appro- 
priate takes time, which results in slowing the system. 

4. Obsession with theoretical correctness: Many computational linguists are con- 
cerned with validating the linguistic theory to which they adhere more than 
improving the performance of the applications they are developing. This has 
led to skepticism about the relevance of linguists to the development of NLP 
applications. 

5. Grammar engineering is expensive: Going beyond toy grammars and develop- 
ing real-life grammars capable of processing unrestricted data proved to be 
very time consuming. Furthermore, it is difficult to get good results. The more 
powerful the grammar becomes, the more rule interactions are encountered by 
linguists. What is fixed in one part of the grammar environment could have an 
adverse reaction on another part. Another problem is that when the grammar 
becomes large, different teams need to work on it. Clear communication be- 
comes extremely important when different teams develop different parts of 
the grammar. 

6. The new data (blogs, emails, chats) are full of errors: Rule-based applications 
could work well with clean data that have fewer ungrammatical sequences. 
Most written texts that were processed by NLP systems in the 1970s and 
1980s were very well written. Flowever, contributors to the digital data on the 
Internet seldom take the time to correct their spellings and/or grammar. Thus, 
current NLP applications must deal with what is known as “noisy data.” Al- 
though noisy data present problems for both rule-based and statistical NLP, it 
is easier to deal within data-driven systems. 

7. The need for contextual information and real-world knowledge: Symbolic 
NLP attempts to mimic the way humans process natural language when they 
communicate with each other. Flumans not only make use of their linguistic 
competence (Chomsky 1965), but they also make use of a vast amount of real- 
life and commonsense knowledge. So far, symbolic NLP makes use of linguis- 
tic knowledge, such as morphological analysis and shallow and deep parsing. 
Flowever, whereas humans have access to a vast amount of real-world and 
commonsense knowledge, the knowledge that symbolic NLP systems have is 
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severely inferior to that possessed by humans. It has been very difficult to 
characterize, formalize, and encode real-world knowledge in a way that com- 
puters can use. Until this becomes possible, symbolic NLP systems’ perfor- 
mance will not rise to the level of humans. 

8. Symbolic NLP systems perform well in very restricted domains: For example, 
the most accurate symbolic machine translation system has been the METEO 
system (Nirenburg 1992), which was developed in the late 1970s to translate 
the weather forecast in Canada. The high-quality output of the METEO sys- 
tem is attributed to the very restricted domain of the language to be translated. 
Narrowing the domain of the source language in a machine translation system 
usually results in fewer ambiguities, which in turn leads to an improvement in 
the quality of translation. 

9. Unrealistic promises and expectations: Developers of rule-based machine 
translation systems used to promise fully automatic translation with 95 percent 
accuracy, but this target has never been achieved. Conversely, customers had 
unrealistic expectations and the result was frustration on both sides. The ad- 
vent of the Internet in the 1990s and the explosion of electronic data led to the 
realization that human resources were not available to process this amount of 
data. Many governments and organizations had to be satisfied with less than 
perfect translation produced by the machine. 

The Statistical Approach to Machine Translation 

The statistical machine translation approach is based on finding the most probable 
translation of a sentence using data gathered from an aligned bilingual corpus. Sta- 
tistical machine translation has been gaining momentum in the last few years, and 
several factors make improving these systems faster and easier. First, monolingual 
and bilingual data on the web are growing, which means that there are enough data 
for language modeling and bilingual text alignment. Second, making these systems 
freely available on the web (e.g., at www.google.com) provides valuable crowd 
sourcing feedback to the systems. Third, academic research on these systems has 
grown and has already resulted in marked improvement. Fourth, current systems do 
not suffer from a lack of fluency in the output, which continues to be a problem for 
rule-based machine translation. Therefore, more and more users prefer statistical to 
rule-based systems. 

The Language Weaver Arabic-to-English Machine Translation System 
Kevin Knight and Daniel Marcu of the University of Southern California founded 
the Language Weaver Inc. in January 2002. The goal was to apply their pioneering 
research in statistical NLP to the commercial objective of producing useful automated 
machine translation systems. Fraser and Wong (2010) describe one of the very first 
products that came from this remarkable transfer of academic research to industry: 
a complete statistical Arabic-to-English machine translation system. This is an excel- 
lent example of how rapidly and inexpensively a statistical machine translation sys- 
tem can be built when parallel corpora and training data are available. 
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The technology behind statistical machine translation is as follows (Fraser and 
Wong 2010). The first step in building any statistical machine translation system is 
to compile a very large bilingual corpus that consists of texts in the source language 
that are translated by human experts into the target language. The second step is to 
perform a process of aligning the source language expressions with their target lan- 
guage equivalents. Alignment is usually done at the word level and/or the phrase 
level. Most words and even some phrases will be aligned with more than one target 
language equivalent. Consider the following possible translations of the Arabic word 
A “hand”: 

(20a) <^A “my hand” 

(b) A-^ A “iron fist” 

(c) <Ul a “God’s support” 

(d) As V l5As “do it myself rather than someone else does it to me” 

(e) UUll a" “the upper hand, the giver” 

(f) JA-Jl aJ' “the taker” 

(g) a “miser, mean” 

A translation model is created by estimating the probability of each of the pos- 
sible alignments of a word or a phrase. The probability of a particular translation 
would depend in part on its frequency in the training set and the degree of similarity 
in the domain of the training set to that of the testing data. The target language sen- 
tences in the bilingual corpus are usually augmented with more data from the target 
language to create a language model. The function of the language model is to help 
select the best possible translation of a phrase, which could be a group of words or 
just one word, and to decide which phrases would work together so that the output 
is more fluent. For example, let us assume we have the following Arabic input: 

(21) l-c>r A*—'' hi 

The translation model produces the phrases “very happy” and “I.” All possible 
permutations of these phrases are considered. The translation model assigns proba- 
bilistic scores for each one and estimates the grammaticality of each one. For ex- 
ample, neither “I very happy” nor “very happy I” is a grammatical English sentence. 
Flowever, “I am very happy” is a well-formed English sentence that is close to the 
suggested translations produced by the translation model. So it receives the high- 
est score. 

There are a number of possibilities for improving the Language Weavers Arabic 
machine translation system — for example, the possibility of incorporating statisti- 
cally based syntactic analysis, more sophisticated morphology, and a special module 
for treating transliterated names of persons and companies, as well as delivering a 
“learning” module that the customer could use after postediting the translation output. 
It would be an excellent addition to the current system to empower users with features 
that would allow them to modify the translation engine to better serve their specific 
domain and to correct observed translation inaccuracies. It should be noted that the 
Language Weaver was acquired in 2010 by SDL, which is a provider of services and 
software for language translation and content management. 
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The AppTek Hybrid Machine Translation System 

Sawaf (2010) describes a hybrid machine translation system for translating written 
and spoken MSA texts as well as Iraqi Arabic, and he also reviews the two main 
approaches to machine translation: statistical and rule-based. After carefully evaluat- 
ing the advantages and disadvantages of each approach, he presents the AppTek 
machine translation system as an embodiment of the positive features of both 
approaches. 

Sawaf found that there are three innovations in the machine translation system. 
First, the translation of entities such as personal names and dates using a named 
entity recognition component improves the quality of the translation. Once a named 
entity is recognized, the system uses several approaches to translate such entities. 
For example, a token such as J-4, when recognized as a personal name, would not 
be translated into its linguistic meaning “hope.” Rather, it would be transcribed into 
its phonetic representation in the target language, /Amel/. Second, the system incor- 
porates Arabic dialects. For example, the current system translates the Iraqi dialect 
into MSA using a bilingual corpus, linguistic features of both MSA and the Iraqi 
dialect, and data training sets. Third, it combines Arabic speech recognition output 
with machine learning. The speech recognition engine was trained using corpora in 
MSA and Iraqi dialects. AppTek was acquired in 2010 by Science Applications In- 
ternational Corporation (SAIC). This acquisition makes SAIC the owner of a com- 
plete set of products for speech and text processing, including machine translation, 
knowledge management, and speech recognition for more than thirty languages. 

Properties of Statistical NLP 

Most statistical machine translation systems exhibit one or more of the following 
properties that are common to statistical NLP systems: 

1 . We can “learn” linguistic knowledge from the data. Although rule-based ma- 
chine translation systems rely exclusively on explicit linguistic knowledge of 
the source and target languages, statistical systems assume that the required 
linguistic knowledge could be “learned” from the data when they are suffi- 
cient. Sophisticated algorithms have been developed to “learn” syntactic rules 
for machine translation systems from aligned corpora. In a very interesting pa- 
per, Galley and his colleagues (2004) propose such an algorithm. 

2. Statistical NLP has achieved great success in developing speech recognition 
systems and continues to be the dominant approach in all speech applications. 

3. Statistical NLP is based on probability theory and the use of language model- 
ing. For example, when there is more than one possible translation of a phrase 
in the source language, the translation that occurs more frequently in the target 
language model is selected. 

4. Although symbolic NLP regards language as a cognitive system, statistical 
NLP adopts an empirical approach by applying statistical, pattern recognition, 
and machine learning techniques. 

5. Symbolic NLP does not require training data, although the availability of cor- 
pora of the source and target languages can be very useful. However, it works 
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well with less commonly studied languages for which there are no existing 
training data. In contrast, statistical machine translation systems rely on the 
existence of parallel corpora of the source and target languages. The quality of 
statistical machine translation systems depends on the similarity between the 
training data and the actual data that will be the input. 

6. The extraction of grammar in a statistical NLP system comes as a result of a 
learning phase, when the system is usually fed training data. When training 
data are annotated, it results in supervised learning. 

7. Statistical NLP systems usually go through an iterative cycle called the train- 
ing stage where the system works on seen data; this is followed by the deploy- 
ment stage, to test the system on unseen data; and, finally, there is an analysis 
of the results. This process may be repeated several times until the results be- 
come satisfactory. 

8. The translations of the symbolic machine translation systems are put together 
by the machine, whereas the output of the statistical systems is very fluent be- 
cause the translation in parallel corpora is made by expert human translators. 

9. Recently, more and more machine learning researchers report improvement in 
performance when incorporating linguistic knowledge (Zitouni, Luo, and Flo- 
rian2010; Sawaf2010). 



Summary and Conclusion 

In this chapter I have traced the beginnings of machine translation from the vision 
that Descartes had four hundred years ago to the first realization of his dream in the 
twentieth century. Since then, machine translation technology has evolved through 
at least three generations — starting from the direct method, which was followed by 
the transfer approach, which was succeeded by the statistical machine translation 
approach. I have briefly described each of the three Arabic machine translation sys- 
tems, which were developed following one of the three approaches. Thus, Arabic 
machine translation has been part and parcel of mainstream machine translation and, 
as such, it has undergone the same development paradigms as mainstream machine 
translation. 
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